Regex to parse out a part of URL using python - python

I am having data as follows,
data['url']
http://hostname.com/aaa/uploads/2013/11/a-b-c-d.jpg https://www.aaa.com/
http://hostname.com/bbb/uploads/2013/11/e-f-g-h.gif https://www.aaa.com/
http://hostname.com/ccc/uploads/2013/11/e-f-g-h.png http://hostname.com/ccc/uploads/2013/11/a-a-a-a.html
http://hostname.com/ddd/uploads/2013/11/w-e-r-t.ico
http://hostname.com/ddd/uploads/2013/11/r-t-y-u.aspx https://www.aaa.com/
http://hostname.com/bbb/uploads/2013/11/t-r-w-q.jpeg https://www.aaa.com/
I want to find out the formats such as .jpg, .gif, .png, .ico, .aspx, .html, .jpeg and parse it out backwards until it finds a "/". Also I want to check for several occurance all through the string. My output should be,
data['parsed']
a-b-c-d
e-f-g-h
e-f-g-h a-a-a-a
w-e-r-t
r-t-y-u
t-r-w-q
I am thinking instead of writing individual commands for each of the formats, is there a way to write everything under a single command.
Can anybody help me in writing for theses commands? I am new to regex and any help would be appreciated.

this builds a list of name to extension pairs
import re
results = []
for link in data:
matches = re.search(r'/(\w-\w-\w-\w)\.(\w{2,})\b', link)
results.append((matches.group(1), matches.group(2)))

This pattern returns the file names. I have just used one of your urls to demonstrate, for more, you could simply append the matches to a list of results:
import re
url = "http://hostname.com/ccc/uploads/2013/11/e-f-g-h.png http://hostname.com/ccc/uploads/2013/11/a-a-a-a.html"
p = r'((?:[a-z]-){3}[a-z]).'
matches = re.findall(p, url)
>>> print('\n'.join(matches))
e-f-g-h
a-a-a-a
There is the assumption that the urls all have the general form you provided.

You might try this:
data['parse'] = re.findall(r'[^/]+\.[a-z]+ ',data['url'])
That will pick out all of the file names with their extensions. If you want to remove the extensions, the code above returns a list which you can then process with list comprehension and re.sub like so:
[re.sub('\.[a-z]+$','',exp) for exp in data['parse']]
Use the .join function to create a string as demonstrated in Totem's answer

Related

Find and replace string with empty paragraph inside .doc and .docx word document

Sample environment:
Dictionary = {"camel":"create-para","donkey":"monkey","cat":"dog"}
cwd = os.getcwd(".")
for files in cwd
if files.endswith(".doc") or files.endswith(".doc"):
for Dictionary in files:
do the changes
2 things to notice:
create-para in dictionary means that remove string1 and create a new paragraph in place of string1.
In VBA macro it is like this:
Dictionary = {"camel":"^p","donkey":"monkey","cat":"dog"}
However, how to do that?
For example, I want to remove the word materials and replace it with a paragraph
Before
After
I'm not fully sure what you are trying to do here, what is for Dictionary in files:? Aren't Dictionary and files two separate variables? Also, I think your if condition should be:
if files.endswith(".doc") or files.endswith(".docx"):
If you are trying to change a doc/docx file, you can achieve it using python-docx. The documentation should be able to help you out. If you want to replace paragraphs, you can use this snippet from the library's GitHub page. If you want to add paragraphs, you can use the add_paragraph function:
document.add_paragraph('A plain paragraph having some ')

Formating csv file to utilize zabbix sender

I have a csv file that i need to format first before i can send the data to zabbix, but i enconter a problem in 1 of the csv that i have to use.
This is a example of the part of the file that i have problems:
Some_String_That_ineed_01[ifcontainsCell01removehere],xxxxxxxxx02
Some_String_That_ineed_01vtcrmp01[ifcontainsCell01removehere]-[1],aass
so this is 2 lines from the file, the other lines i already treated.
i need to check if Cell01 is in the line
if 'Cell01' in h: do something.
i need to remove all the content beetwen the [ ] included this [] that contains the Cell01 word and leave only this:
Some_String_That_ineed_01,xxxxxxxxx02
Some_String_That_ineed_01vtcrmp01-[1],aass
else my script already treats quite easy. There must be a better way then what i think which is use h.split in the first [ then split again on the , then remove the content that i wanna then add what is left sums strings. since i cant use replace because i need this data([1]).
later on with the desired result i will add this to zabbix sender as the item and item key_. I already have the hosts,timestamps,values.
You should use a regexp (re module):
import re
s = "Some_String_That_ineed_01[ifcontainsCell01removehere],xxxxxxxxx02"
replaced = re.sub('\[.*Cell01.*\]', '', s)
print (replaced)
will return:
[root#somwhere test]# python replace.py
Some_String_That_ineed_01,xxxxxxxxx02
You can also experiment here.

Python - Regex - Match anything except

I'm trying to get my regular expression to work but can't figure out what I'm doing wrong. I am trying to find any file that is NOT in a specific format. For example all files are dates that are in this format MM-DD-YY.pdf (ex. 05-13-17.pdf). I want to be able to find any files that are not written in that format.
I can create a regex to find those with:
(\d\d-\d\d-\d\d\.pdf)
I tried using the negative lookahead so it looked like this:
(?!\d\d-\d\d-\d\d\.pdf)
That works in not finding those anymore but it doesn't find the files that are not like it.
I also tried adding a .* after the group but then that finds the whole list.
(?!\d\d-\d\d-\d\d\.pdf).*
I'm searching through a small list right now for testing:
05-17-17.pdf Test.pdf 05-48-2017.pdf 03-14-17.pdf
Is there a way to accomplish what I'm looking for?
Thanks!
You can try this:
import re
s = "Test.docx 04-05-2017.docx 04-04-17.pdf secondtest.pdf"
new_data = re.findall("[a-zA-Z]+\.[a-zA-Z]+|\d{1,}-\d{1,}-\d{4}\.[a-zA-Z]+", s)
Output:
['Test.docx', '04-05-2017.docx', 'secondtest.pdf']
First find all that are matching, then remove them from your list separately. firstFindtheMatching method first finds matching names using re library:
def firstFindtheMatching(listoffiles):
"""
:listoffiles: list is the name of the files to check if they match a format
:final_string: any file that doesn't match the format 01-01-17.pdf (MM-DD-YY.pdf) is put in one str type output. (ALSO) I'm returning the listoffiles so in that you can see the whole output in one place but you really won't need that.
"""
import re
matchednames = re.findall("\d{1,2}-\d{1,2}-\d{1,2}\.pdf", listoffiles)
#connect all output in one string for simpler handling using sets
final_string = ' '.join(matchednames)
return(final_string, listoffiles)
Here is the output:
('05-08-17.pdf 04-08-17.pdf 08-09-16.pdf', '05-08-17.pdf Test.pdf 04-08-17.pdf 08-09-16.pdf 08-09-2016.pdf some-all-letters.pdf')
set(['08-09-2016.pdf', 'some-all-letters.pdf', 'Test.pdf'])
I've used the main below if you like to regenerate the results. Good thing about doing it this way is that you can add more regex to your firstFindtheMatching(). It helps you to keep things separate.
def main():
filenames= "05-08-17.pdf Test.pdf 04-08-17.pdf 08-09-16.pdf 08-09-2016.pdf some-all-letters.pdf"
[matchednames , alllist] = firstFindtheMatching(filenames)
print(matchednames, alllist)
notcommon = set(filenames.split()) - set(matchednames.split())
print(notcommon)
if __name__ == '__main__':
main()

How to do parsing in python?

I'm kinda new to Python. And I'm trying to find out how to do parsing in Python?
I've got a task: to do parsing with some piece of unknown for me symbols and put it to DB. I guess I can create DB and tables with help of SQLAlchemy, but I have no idea how to do parsing and what all these symbols below mean?
http://joxi.ru/YmEVXg6Iq3Q426
http://joxi.ru/E2pvG3NFxYgKrY
$$HDRPUBID 112701130020011127162536
H11127011300UNIQUEPONUMBER120011127
D11127011300UNIQUEPONUMBER100001112345678900000001
D21127011300UNIQUEPONUMBER1000011123456789AR000000001
D11127011300UNIQUEPONUMBER200002123456987X000000001
D21127011300UNIQUEPONUMBER200002123456987XIR000000000This item is inactive. 9781605600000
$$EOFPUBID 1127011300200111271625360000005
Thanks in advance those who can give me some advices what to start from and how the parsing is going on?
The best approach is to first figure out where each token begins and ends, and write a regular expression to capture these. The site RegexPal might help you design the regex.
As other suggest take a look to some regex tutorials, and also re module help.
Probably you're looking to something like this:
import re
headerMapping = {'type': (1,5), 'pubid': (6,11), 'batchID': (12,21),
'batchDate': (22,29), 'batchTime': (30,35)}
poaBatchHeaders = re.findall('\$\$HDR\d{30}', text)
parsedBatchHeaders = []
batchHeaderDict = {}
for poaHeader in poaBatchHeaders:
for key in headerMapping:
start = headerMapping[key][0]-1
end = headerMapping[key][1]
batchHeaderDict.update({key: poaHeader[start:end]})
parsedBatchHeaders.append(batchHeaderDict)
Then you have list with dicts, each dict contains data for each attribute. I assume that you have your datafile in text which is string. Each dict is made for one found structure (POA Batch Header in example).
If you want to parse it further, you have to made a function to parse each date in each attribute.
def batchDate(batch):
return (batch[0:2]+'-'+batch[2:4]+'-20'+batch[4:])
for header in parsedBatchHeaders:
header.update({'batchDate': batchDate( header['batchDate'] )})
Remember, that's an example and I don't know documentation of your data! I guess it works like that, but rest is up to you.

Why don't I find the words in their original source list?

I am trying to find chinesse words in two differnet files, but It didn't work so I tried to search for the words in the same file I get them from, but it seems it doesn't find it neither? how is it possible?
chin_split = codecs.open("CHIN_split.txt","r+",encoding="utf-8")
used this for the regex code.
import re
for n in re.findall(ur'[\u4e00-\u9fff]+',chin_split.read()):
print n in re.findall(ur'[\u4e00-\u9fff]+',chin_split.read())
how comes I get only falses printed???
FYI I tried to do this and it works:
for x in [1,2,3,4,5,6,6]:
print x in [1,2,3,4,5,6,6]
BTW
chin_split contains words in English Hebrew and Chinese
some lines from chin_split.txt:
he daodan 核导弹 טיל גרעיני
hedantou 核弹头 ראש חץ גרעיני
helu 阖庐 "ביתו, מעונו
helu 阖庐 שם מלך וו בתקופת ה'אביב והסתיו'"
huiwu 会晤 להיפגש עם
You are reading a file descriptor many times and that is wrong.
The first chin_split.read() will yield all the content but the others (inside the loop) will just get an empty string.
That loop makes no sense, but if you want to keep it, save the file content in a variable first.

Categories

Resources