how do I print parts of regex - python

I cannot print components of matched regex.
I am learning python3 and I need to verify that output of my command matches my needs. I have following short code:
#!/usr/bin/python3
import re
text_to_search = '''
1 | 27 23 8 |
2 | 21 23 8 |
3 | 21 23 8 |
4 | 21 21 21 |
5 | 21 21 21 |
6 | 27 27 27 |
7 | 27 27 27 |
'''
pattern = re.compile('(.*\n)*( \d \| 2[17] 2[137] [ 2][178] \|)')
matches = pattern.finditer(text_to_search)
for match in matches:
print (match)
print ()
print ('matched to group 0:' + match.group(0))
print ()
print ('matched to group 1:' + match.group(1))
print ()
print ('matched to group 2:' + match.group(2))
and following output:
<_sre.SRE_Match object; span=(0, 140), match='\n 1 | 27 23 8 |\n 2 | 21 23 8 |\n 3 >
matched to group 0:
1 | 27 23 8 |
2 | 21 23 8 |
3 | 21 23 8 |
4 | 21 21 21 |
5 | 21 21 21 |
6 | 27 27 27 |
7 | 27 27 27 |
matched to group 1: 6 | 27 27 27 |
matched to group 2: 7 | 27 27 27 |
please explain me:
1) why "print (match)" prints only beginning of match, does it have some kind of limit to trim output if its bigger than some threshold?
2) Why group(1) is printed as "6 | 27 27 27 |" ? I was hope (.*\n)* is as greedy as possible so it consumes everything from 1-6 lines, leaving last line of text_to_search to be matched against group(2), but seems (.*\n)* took only 6-th line. Why is that? Why lines 1-5 are not printed when printing group(1)?
3) I was trying to go through regex tutorial but failed to understand those tricks with (?...). How do I verify if numbers in last row are equal (so 27 27 27 is ok, but 21 27 27 is not)?

1) The print(match) only shows an outline of the object. match is an SRE_Match object, so in order to get information from it you need to do something like match.group(0), which is accessing a value stored in the object.
2) to capture lines 1-6 you need to change (.*\n)* to ((?:.*\n)*) according to this regex tester,
A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data
3) to match specific numbers you need to make it more specific and include these numbers into a seperate group at the end.

Related

How can you pre-process a pyspark.read query to remove certain text and characters like \n

Suppose we have a dataset as follows:
NAME | AGE \n
"Greg" | 25 \n
"Frank" | 33 \n
"Scotty" | \n
26 | \n
"Dave" | 44
If we read the data in using
pyspark.read.csv(data_location)
We would get an erroneous dataframe. And at that point it's not able to be edited. Note the 26 under NAME.
NAME
AGE
Greg
25
Frank
33
Scotty
26
Dave
44
How would we go about correcting this table in a reproducible manner? Once the data is read into the DataFrame I don't think I can really do much anymore.

Rotate list for every letter in string

I am attempting to make a Caesar cipher that changes the key each letter, I currently have a working cipher that scrambles the entire string once, running 1-25 however I would like it to do it for each letter, as in the string "ABC" would shift A by 1, B by 2 and C by 3, resulting in BDF
I already have a working cipher, and am just not sure how to have it change each letter.
upper = collections.deque(string.ascii_uppercase)
lower = collections.deque(string.ascii_lowercase)
upper.rotate(number_to_rotate_by)
lower.rotate(number_to_rotate_by)
upper = ''.join(list(upper))
lower = ''.join(list(lower))
return rotate_string.translate(str.maketrans(string.ascii_uppercase, upper)).translate(str.maketrans(string.ascii_lowercase, lower))
#print (caesar("This is simple", 2))
our_string = "ABC"
for i in range(len(string.ascii_uppercase)):
print (i, "|", caesar(our_string, i))
Outcome is this:
0 | ABC
1 | ZAB
2 | YZA
3 | XYZ
4 | WXY
5 | VWX
6 | UVW
7 | TUV
8 | STU
9 | RST
10 | QRS
11 | PQR
12 | OPQ
13 | NOP
14 | MNO
15 | LMN
16 | KLM
17 | JKL
18 | IJK
19 | HIJ
20 | GHI
21 | FGH
22 | EFG
23 | DEF
24 | CDE
25 | BCD
What I would like is to have it a shift of 1 or 0 for the first letter, then 2 for the second, and so on.
Good effort! Note that the mapping doesn't only rearrange letters in the alphabet, so it's never achieved by rotating the alphabet. In your example, upper would become the following mapping:
ABCDEFGHIJKLMNOPQRSTUVWXYZ
BDFHJLNPRTVXZBDFHJLNPRTVXZ
Also note this cipher is not easily reversible, i.e. it's not clear whether to reverse 'B'->'A' or 'B'->'N'.
(Side note: If we treat letters ZABCDEFGHIJKLMNOPQRSTUVWXY as numbers 0-25, this cipher multiplies by two (in modulo 26): (x*2)%26. If instead of 2, we multiply by any number not divisible by 2 and 13, the resulting cipher will always be reversible. Can you see why? Hints: [1], [2].)
When you feel confused about a piece of code, often it's a good sign it's time to refactor a part of it into a separate function, e.g. like this:
(Playground: https://ideone.com/wNSADR)
import string
def letter_index(letter):
"""Determines the position of the given letter in the English alphabet
'a' -> 0
'A' -> 0
'z' -> 25
"""
if letter not in string.ascii_letters:
raise ValueError("The argument must be an English letter")
if letter in string.ascii_lowercase:
return ord(letter) - ord('a')
return ord(letter) - ord('A')
def caesar(s):
"""Ciphers the string s by shifting 'A'->'B', 'B'->'D', 'C'->'E', etc
The shift is cyclic, i.e. 'A' comes after 'Z'.
"""
ret = ""
for letter in s:
index = letter_index(letter)
new_index = 2*index + 1
if new_index >= len(string.ascii_lowercase):
# The letter is shifted farther than 'Z'
new_index %= len(string.ascii_lowercase)
new_letter = chr(ord(letter) - index + new_index)
ret += new_letter
return ret
print('caesar("ABC"):', caesar("ABC"))
print('caesar("abc"):', caesar("abc"))
print('caesar("XYZ"):', caesar("XYZ"))
Output:
caesar("ABC"): BDF
caesar("abc"): bdf
caesar("XYZ"): VXZ
Resources:
chr
ord

Python Grouping column values into one value

Hi all so using this past link:
I am trying to consolidate columns of values into rows using groupby:
hp = hp[hp.columns[:]].groupby('LC_REF').apply(lambda x: ','.join(x.dropna().astype(str)))
#what I have
22 23 24 LC_REF
TV | WATCH | HELLO | 2C16
SCREEN | SOCCER | WORLD | 2C16
TEST | HELP | RED | 2C17
SEND |PLEASE |PARFAIT | 2C17
#desired output
22 | TV,SCREEN
23 | WATCH, SOCCER
24 | HELLO, WORLD
25 | TEST, SEND
26 | HELP,PLEASE
27 | RED, PARFAIT
Or some sort of variation where column 22,23,24 is combined and grouped by LC_REF. My current code turns all of column 22 into one row, all of column 23 into one row, etc. I am so close I can feel it!! Any help is appreciated
It seems you need:
df = hp.groupby('LC_REF')
.agg(lambda x: ','.join(x.dropna().astype(str)))
.stack()
.rename_axis(('LC_REF','a'))
.reset_index(name='vals')
print (df)
LC_REF a vals
0 2C16 22 TV,SCREEN
1 2C16 23 WATCH,SOCCER
2 2C16 24 HELLO,WORLD
3 2C17 22 TEST,SEND
4 2C17 23 HELP,PLEASE
5 2C17 24 RED,PARFAIT

Trying to split out a character from a bash "list"

So I have this command that runs the following report in your shell;
Command output
Available Reports for: isl-01-chi Time Zone: CDT
================================================================================
|ID |FSA Job Start |FSA Job End |Size |
================================================================================
|313 |Aug 21 2016, 10:00 PM |Aug 22 2016, 12:33 AM |1.040G |
--------------------------------------------------------------------------------
|318 |Aug 22 2016, 10:00 PM |Aug 23 2016, 12:35 AM |1.039G |
--------------------------------------------------------------------------------
|323 |Aug 23 2016, 10:00 PM |Aug 24 2016, 12:34 AM |1.045G |
--------------------------------------------------------------------------------
|328 |Aug 24 2016, 10:00 PM |Aug 25 2016, 12:35 AM |1.043G |
--------------------------------------------------------------------------------
|333 |Aug 25 2016, 10:00 PM |Aug 26 2016, 12:57 AM |1.057G |
--------------------------------------------------------------------------------
|339 |Aug 26 2016, 10:00 PM |Aug 27 2016, 03:01 AM |2.183G |
--------------------------------------------------------------------------------
|346 |Aug 28 2016, 07:24 AM |Aug 28 2016, 11:53 AM |2.183G |
--------------------------------------------------------------------------------
|351 |Aug 28 2016, 10:00 PM |Aug 29 2016, 02:37 AM |2.182G |
================================================================================
What I'm looking to do is find the latest ID (Greatest number) and was wondering what the easiest method of doing this was in python ?
how about
largest_id = max(int(line.split()[0][1:]) for line in output.split("\n")[5::2])
if the output is always sorted, then
largest_id = int(output.split('\n')[-2].split()[0][1:])
more educatively :
lines = output.split('\n')
second_to_last_line = lines[-2]
splitted_by_whitespace = second_to_last_line.split()
first_non_whitespace_blob = splitted_by_whitespace[0]
id_string_ignoring_the_column_char = first_non_whitespace_blob[1:]
id = int(id_string_ignoring_the_column_char)
If the output is always sorted, then read each line into a list, get the second to last entry in the list, and discard the rest.
I think that is better thake the value with shell.
your_command | grep -Eo '^\|[0-9]+'| cut -d "|" -f2 | sort | tail -n1
I'd use a regex because it looks like the ID is a 3 digit (or more) number preceded by the pipe (|) character.
This should do it:
regex = re.compile(r'(?<=\|)\d{,3}')
m = regex.findall(text)
max(m)
data="..." #// Load Your Shell output
l=[]
Output=""
MaxID=0
for a in data.split("\n"):
id=a.split()[0][1:]
l.append(id)
if max(l)==id:
Output=a
MaxID=id
print MaxID,Output
##
Also you can do it using sort command in shell.
your_command | awk '{print $1}' | sort | tail -n1

Python replace integers by integers only at particular location

hi i have a file which contains the data as shown below. I want to replace the integers which occurs after 'A' (fourth column) 2,3,15,25,115,1215 with other integers which i have them in dictionary (key,value). the number of white spaces after 'A' ranges from 0-3. I tried str.replace(old,new) method in python but it replaces all instance of the integers in the file.
This is the replacement i want to do inside the file.
replacements = {2:0,3:5,15:7,25:30,115:120,1215:1220}
Name 1 N ASHA A 2 35 23
Name 2 R MONA A 3 25 56
Name 3 P TERY A 15 23 32
Name 4 Q JACK A 25 56 25
Name 5 D TOM A 115 57 45
Name 3 P SEN A1215 45 56
Suggest me some ways to do it.
replacements = {2:0,3:5,15:7,25:30,115:120,1215:1220}
s="""Name 1 N ASHA A 2 35 23
Name 2 R MONA A 3 25 56
Name 3 P TERY A 15 23 32
Name 4 Q JACK A 25 56 25
Name 5 D TOM A 115 57 45
Name 3 P SEN A1215 45 56"""
res = []
for line in s.splitlines():
spl = line.split()
if len(spl) == 8:
ints = map(int,spl[-3:])
res.append(" ".join(spl[:-3]+[str(replacements.get(k, str(k))) for k in ints]))
else:
spl[-3] = spl[-3].replace("A","")
ints = map(int,spl[-3:])
res.append(" ".join(spl[:-3]+["A"]+[str(replacements.get(k, str(k))) for k in ints]))
print(res)
['Name 1 N ASHA A 0 35 23', 'Name 2 R MONA A 5 30 56', 'Name 3 P TERY A 7 23 32', 'Name 4 Q JACK A 30 56 30', 'Name 5 D TOM A 120 57 45', 'Name 3 P SEN A 1220 45 56']
Not sure if you want to use the data or write it to a file but if your file is like your example this will replace the digits from the dict, if the len of split is different we know we have a number and an A without a space so we replace .
There will also always be a space so if you write to file and have to work on the file again it will be a lot easier.
I would just remove the map and use strings as keys and values unless you actually want ints.
If you want to keep the exact same format and only want to change the first number:
replacements = {"2":"0","3":"5","15":"7","25":"30","115":"120","1215":"1220"}
s="""Name 1 N ASHA A 2 35 23
Name 2 R MONA A 3 25 56
Name 3 P TERY A 15 23 32
Name 4 Q JACK A 25 56 25
Name 5 D TOM A 115 57 45
Name 3 P SEN A1215 45 56"""
res = []
for line in s.splitlines():
spl = line.rsplit(None, 3)
end = spl[-3:]
if "A" == end[0][0]:
k = end[0][1:]
res.append(line.replace(k,replacements.get(k,k)))
else:
k = end[0]
res.append(line.replace(k,replacements.get(k,k)))
print(res)
['Name 1 N ASHA A 0 35 03', 'Name 2 R MONA A 5 25 56', 'Name 3 P TERY A 7 23 32', 'Name 4 Q JACK A 30 56 30', 'Name 5 D TOM A 120 57 45', 'Name 3 P SEN A1220 45 56']
Editted based on additional info regarding all other numbers.
This is entirely dependent on the specific characteristics of your file that you mention in your comments.
replacements = {2:0,3:5,15:7,25:30,115:120,1215:1220}
with open('input.txt', 'r') as fin, open('output.txt', 'w') as fout:
pos_a = 22 # 0-indexed position of 'A' in every line
for line in fin:
left_side = line[:pos_a + 1]
num_to_convert = line[pos_a + 1: pos_a + 5]
right_side = line[pos_a + 5:]
# String formatting to preserve padding as per original file
newline = '{}{:>4}{}'.format(left_side,
replacements[int(num_to_convert)],
right_side)
fout.write(newline)
If there's a possibility that one of the values in the column will not be in your replacements dict, and you want to keep that value unchanged, then instead of replacements[int(num1)], do replacements.get(int(num1), num1)
Regex101
^[\w\d\s]{23}([\d\s]{1,4}).*$
Debuggex Demo
Note: This is more of a fixed length parsing
Python
import re
replacements = {2:0,3:5,15:7,25:30,115:120,1215:1220}
searchString = "Name 1 N ASHA A 2 35 23 "
replace_search = re.search('^[\w\d\s]{23}([\d\s]{1,4}).*$', searchString, re.IGNORECASE)
if replace_search:
result = replace_search.group(1)
convert_result = int(result)
dictionary_lookup = int(replacements[convert_result])
replace_result = '% 4d' % dictionary_lookup
regex_replace = r"\g<1>" + replace_result + r"\g<3>"
line = re.sub(r"^([\w\d\s]{23})([\d\s]{1,4})(.*)$", regex_replace, searchString)
print(line)

Categories

Resources