How to use regular expression to retrieve specific text in Python - python

So I'm trying to retrieve all male members from a name list, it looks something like this: A B(male) C D E(male) F(male) G
All strings are separated with space. The name list is saved as a txt file: name.txt
I would like to have Python to read in name.txt and retrieve all males from the list, then print them out (in this case B E and F).
How do I use regular expression to achieve that? Thanks!

I am just giving the regex expression, regex = r"(\w+)\(male\)"

It's apparently some data. Why are you storing and retrieving it from a text file?
If it's some temp data being stored in a text file maybe change the formatting and specify both 'Male' and 'Female' and also one entry per line so you can loop through the file?
That'll be more systematic.
So all you'll have to do is look for a string match for 'Male' in every line and select that line to print.

Related

Rewrite a specific portion of a text file in python

(1) I am using Python and would like to create a function that rewrites a portion of a text file. Referencing the sample example below, I would like to be able to delete everything from [Variables] onwards and write new content from that position. I can't figure out how to achieve this using any of seek(), truncate() and/or tell().
I'm thinking I may have to read and store the file's contents up to [Variables] and write that back in before appending the new content. Is there a better way to go about this?
(2) Bonus question: How would I do this if there was content beyond the variables section that I wanted to remain unchanged? This is currently not required, but it would be helpful to know for the future.
Sample Text File:
"[Log]
This happened
That happened
etc
[Variables]
Animals: [Dog, Cat]
Number: 4"
You can try to use regex:
import re
string = text
word = '[Variables]'
# The Regex pattern to match al characters on and after '[Variables]'
pattern = word + ".*"
# Remove all characters after '[Variables]' from string
string = re.sub(pattern, '', string)
print(string)
Here if the text is the text that you show on your question, the output of the code will be:
"[Log]
This happened
That happened
etc"
In order to add new text at the end you just need to concatenate a new string to the existing one like:
string += "Some Text"

Get File name from a dash delimited string in python lambda

I have a bunch of file name like this in s3
1623130500-1623130500-Photo-verified-20210631-0-22.csv.gz
1623130500-1623130500-Add-to-cart-20210631-0-4.csv.gz
with lambda python code can I separate only Photo-verified / Add-to-cart from the above?
I need a solution which give me file name on runtime from above kind of string
I think you are asking how to extract either Photo-verified or Add-to-cart from the above strings.
You can split on - and then extract the portion you want. Basically, you don't want the first two parts or the last 3 parts, so use:
filename.split('-')[2:-3]
That will return a list with:
['Photo', 'verified']
You could then join() them together using:
'-'.join(filename.split('-')[2:-3])
This would give:
Photo-verified
On the second string, it would give:
Add-to-cart

How to check if a line contains a string in Python

I'm trying to check if a subString exists in a string using regular expression.
RE : re_string_literal = '^"[a-zA-Z0-9_ ]+"$'
The thing is, I don't want to match any substring. I'm reading data from a file:
Now one of the lines have this text:
cout<<"Hello"<<endl;
I just want to check if there's a string inside the line and if yes, store it in a list.
I have tried the re.match method but it only works if we have to match a pattern, but in this case, I just want to check if a string exists or not, if yes, store it somewhere.
re_string_lit = '^"[a-zA-Z0-9_ ]+"$'
text = 'cout<<"Hello World!"<<endl;'
re.match(re_string_lit,text)
It doesn't output anything.
In simple words,
I just want to extract everything inside ""
If you just want to extract everything inside "" then string splitting would be much simpler way of doing things.
>>> a = 'something<<"actualString">>something,else'
>>> b = a.split('"')[1]
>>> b
'actualString'
The above example would only work for not more than 2 instances of double quotes ("), but you could make it work by iterating over every substring extracted using split method and applying a much simpler Regular Expression.
This worked for me:
re.search('"(.+?)"', 'cout<<"Hello"<<endl')

replacing a character in a line of string

I have a .txt file with 20 lines. Each line carrying 10 zeroes separated by comma.
0,0,0,0,0,0,0,0,0,0
I want to replace every fifth 0 with 1 in each line. I tried .replace function but I know there must be some easy way in python
You can split the text string with the below command.
text_string="0,0,0,0,0,0,0,0,0,0"
string_list= text_string.split(",")
Then you can replace every fifth element in the list string_list using insert command.
for i in range(4,len(string_list),5):
string_list.insert(i,"1")
After this join the elements of the list using join method
output = "".join([str(i)+"," for i in string_list])
The output for this will be :
'0,0,0,0,1,0,0,0,0,1,0,0,'
This is one way of doing
If text in this File follows some rule, you can parse it as CSV file, and change every fifth index and rewrite it to a new file.
But if you want to modify the existing text file, like replace the character then you can use seek refer to How to modify a text file?

Regex filter containing word at beginning but not containing another word

suppose i have the following string
GPH_EPL_GK_FIN
i want a regex that ill be using in python that looks for such string from a csv file (not relevant to this question) for records that start with GPH but DONT contain EPL
i know carrot ^ is used for searching at beginning
so i have something like this
^GPH_.*
i want to include the NOT contain part as well, how do i chain the regex?
i.e.
(^GPH_.*)(?!EPL)
i would like to take this a step further eventually and any records that are returned without EPL, i.e.
GPH_ABC_JKL_OPQ
to include AFTER GPH_ the EPL part
i.e. desired result
GPH_EPL_ABC_JKL_OPQ
To cover both requirements:
compose a pattern to match lines that start with GPH but DONT contain EPL
insert EPL_ part into matched line to a particular position
import re
# sample string containing lines
s = '''GPH_EPL_GK_FIN
GPH_ABC_JKL_OPQ'''
pat = re.compile(r'^(GPH_)(?!.*EPL.*)')
for line in s.splitlines():
print(pat.sub('\\1EPL_', line))
The output:
GPH_EPL_GK_FIN
GPH_EPL_ABC_JKL_OPQ
This here would do, I think:
^GPH_(?!EPL).*
This will return any string that start with GPH and does not have EPL after GPH_.
I'm just guessing that one option would be,
(?<=^GPH_(?!EPL))
and re.sub with,
EPL_
Test
import re
print(re.sub(r"(?<=^GPH_(?!EPL))", "EPL_", "GPH_ABC_JKL_OPQ"))
Output
GPH_EPL_ABC_JKL_OPQ
Simply use this:
https://regex101.com/r/GwBsg2/2
pattern: ^(?!^(?:[^_\n]+_)*EPL_?(?:[^_\n]+_?)*)(.*)GPH
substitute: \1GPH_EPL
flags: gm

Categories

Resources