Unable to do regex based operations in a gzip file in Python - python

I have .gz file that contains several strings. My requirement is that I have to do several regex based operations in the data that is contained in the .gz file
I get the error when I use a re.findall() in the lines of data extracted
File "C:\Users\santoshn\AppData\Local\Continuum\anaconda3\lib\re.py", line 182, in search
return _compile(pattern, flags).search(string)
TypeError: cannot use a string pattern on a bytes-like object
I have tried opening with option "r" with the same result.
Do I have to decompress this file first and then do the regex operations or is there a way to address this ?
Data contains several text lines, an example line is listed below:
ThreadContext 432 mov (8) <8;1,2>r2 <8;3,3>r4 Instruction count

I was able to fix this issue by reading the file using gzip.open()
with gzip.open(file,"rb") as f:
binFile = f.readlines()
After this file is read, each line in the file is converted to 'ascii'. Subsequently all regex operations like re.search() and re.findall() work fine.
for line in binFile: # go over each line
line = line.strip().decode('ascii')

I know this is an old question but I stumbled on it (as well as the other HTML references in the comments) when trying to sort out this same issue. Rather than opening the gzip file as a binary ("rb") and then decoding it to ASCII the gzip docs led me to simply opening the GZ file as text which allowed normal string manipulation after that:
with gzip.open(filepath,"rt") as f:
data = f.readlines()
for line in data:
split_string = date_time_pattern.split(line)
# Whatever other string manipulation you may need.
The date_time_pattern variable is simply my compiled regex for different log date formats.

Related

Parse Binary String from a file previously written from C++ parse it in Python

I have a file the output of it has serialized string from a protobuf in C++. I want to read it in python file. Here is the content of the file.
\n\013\010\202\331\242\233\006\020\200\256\263F\022$\022\024\r\255\314\355>\025\303\255&?\035<V\312=%\014H\223>\035\200\242\376>\"\007\022\005Pants\022\"\022\024\rmv\036?\025|\032\224>\0354z->%\334\3354?\035\341G\360>\"\005\022\003Man\022\"\022\024\r\332K\335>\025a\261\252>\035\\\3534>%z\330\"?\035\324\334\354>\"\005\022\003Man\022$\022\024\r-\202%?\025*\331$?\035\300\240\310=%\364\r\236>\035\2778\352>\"\007\022\005Pants\022\"\022\024\r^\350-?\025\246\3034?\035\240\250E>%h~\226>\035\256\301\277>\"\005\022\003Man\022\'\022\024\r\262\271\336>\025\364\261\261>\035T\3310>%E\344\035?\035\261l\253>\"\n\022\010Clothing\022\'\022\024\rs\255\037?\025\023\207\226>\035\3347\'>%\222\3721?\035m\364\247>\"\n\022\010Clothing\022\"\022\024\r0\341\337>\025\224\032\346>\035\354q%>%D%m>\035\273\336\240>\"\005\022\003Top\022$\022\024\r0\341\337>\025\224\032\346>\035\354q%>%D%m>\035\036\035\226>\"\007\022\005Shirt\022\"\022\024\r\245G!?\025e\340\314>\0354\222 >%;\003\213>\035v2\225>\"\005\022\003Top\022%\022\024\r^\350-?\025\246\3034?\035\240\250E>%h~\226>\035*\204\217>\"\010\022\006Person\022&\022\024\rB\264!?\025\016r\314>\035\354t\036>%\260\373\213>\035\037/\217>\"\t\022\007T-shirt\022%\022\024\rJp\325=\025\212\260\241>\035V\361\344=%\224\036i>\035K\253\213>\"\010\022\006Person\022\'\022\024\r\307{\262>\025\332\360\"<\035\214\307\252=%\027hP>\035\033D\211>\"\n\022\010Lighting\022$\022\024\r\245G!?\025e\340\314>\0354\222 >%;\003\213>\0352t\207>\"\007\022\005Shirt\022\'\022\024\r\222\034\207<\025\037`\266>\035D\351\254=%\262BN>\0355\014\207>\"\n\022\010Lighting\022\"\022\024\r\370Q\330=\025NH\247>\035\354;\260=%\270gQ>\035\215\304\203>\"\005\022\003Man\0220\022\024\r\3123\217<\025\272\252#?\035\250\340\244>%z\257\267>\035\217\353~>\"\023\022\021Electronic device\022\'\022\024\r\330>\340>\025R\352\345>\035(\000%>%l\255l>\035\265\002~>\"\n\022\010Clothing\022$\022\024\r\302\3779>\025\206)\244>\035|\337\342=%\360jK>\035\244t{>\"\007\022\005Table\022%\022\024\rmv\036?\025|\032\224>\0354z->%\334\3354?\035\342E{>\"\010\022\006Person\022\'\022\024\r \323\330=\025\334\200\262>\035d\307\267=%\250._>\035-\224u>\"\n\022\010Clothing\022\'\022\024\re\017/?\025,\240;?\035`!\032>%P\333\210>\035\221up>\"\n\022\010Clothing\022\'\022\024\r\007\315$?\025\306E%?\035\240\312\307=%\362\014\237>\035\203)d>\"\n\022\010Clothing\022$\022\024\r-\202%?\025*\331$?\035\300\240\310=%\364\r\236>\035\010\261a>\"\007\022\005Jeans\022$\022\024\rk\247\355>\025NT&?\035\2240\313=%\272\272\223>\035\243\227a>\"\007\022\005Jeans\022\'\022\024\r\245G!?\025e\340\314>\0354\222 >%;\003\213>\035#_`>\"\n\022\010Clothing\0220\022\024\rF\'\027?\025dn}>\035\360J\232=%l\244;>\035m%\\>\"\023\022\021Electronic device\022,\022\024\r\222\034\207<\025\037`\266>\035D\351\254=%\262BN>\035F\276R>\"\017\022\rLight fixture\022(\022\024\r\247s=>\025\337a\251>\035\216\365\321=%zY\r>\035LIN>\"\013\022\tTable top
If I take the above string and assign it as a binary string, I am able to Parse it and interpret values correctly.
object_reader = objreader.ObjectDetectionPredictionResult()
orig = b'\n\013\010\202\331\242\233\006\020\200\256\263F\022$\022\024\r\255\314\355>\025\303\255&?\035<V\312=%\014H\223>\035\200\242\376>\"\007\022\005Pants\022\"\022\024\rmv\036?\025|\032\224>\0354z->%\334\3354?\035\341G\360>\"\005\022\003Man\022\"\022\024\r\332K\335>\025a\261\252>\035\\\3534>%z\330\"?\035\324\334\354>\"\005\022\003Man\022$\022\024\r-\202%?\025*\331$?\035\300\240\310=%\364\r\236>\035\2778\352>\"\007\022\005Pants\022\"\022\024\r^\350-?\025\246\3034?\035\240\250E>%h~\226>\035\256\301\277>\"\005\022\003Man\022\'\022\024\r\262\271\336>\025\364\261\261>\035T\3310>%E\344\035?\035\261l\253>\"\n\022\010Clothing\022\'\022\024\rs\255\037?\025\023\207\226>\035\3347\'>%\222\3721?\035m\364\247>\"\n\022\010Clothing\022\"\022\024\r0\341\337>\025\224\032\346>\035\354q%>%D%m>\035\273\336\240>\"\005\022\003Top\022$\022\024\r0\341\337>\025\224\032\346>\035\354q%>%D%m>\035\036\035\226>\"\007\022\005Shirt\022\"\022\024\r\245G!?\025e\340\314>\0354\222 >%;\003\213>\035v2\225>\"\005\022\003Top\022%\022\024\r^\350-?\025\246\3034?\035\240\250E>%h~\226>\035*\204\217>\"\010\022\006Person\022&\022\024\rB\264!?\025\016r\314>\035\354t\036>%\260\373\213>\035\037/\217>\"\t\022\007T-shirt\022%\022\024\rJp\325=\025\212\260\241>\035V\361\344=%\224\036i>\035K\253\213>\"\010\022\006Person\022\'\022\024\r\307{\262>\025\332\360\"<\035\214\307\252=%\027hP>\035\033D\211>\"\n\022\010Lighting\022$\022\024\r\245G!?\025e\340\314>\0354\222 >%;\003\213>\0352t\207>\"\007\022\005Shirt\022\'\022\024\r\222\034\207<\025\037`\266>\035D\351\254=%\262BN>\0355\014\207>\"\n\022\010Lighting\022\"\022\024\r\370Q\330=\025NH\247>\035\354;\260=%\270gQ>\035\215\304\203>\"\005\022\003Man\0220\022\024\r\3123\217<\025\272\252#?\035\250\340\244>%z\257\267>\035\217\353~>\"\023\022\021Electronic device\022\'\022\024\r\330>\340>\025R\352\345>\035(\000%>%l\255l>\035\265\002~>\"\n\022\010Clothing\022$\022\024\r\302\3779>\025\206)\244>\035|\337\342=%\360jK>\035\244t{>\"\007\022\005Table\022%\022\024\rmv\036?\025|\032\224>\0354z->%\334\3354?\035\342E{>\"\010\022\006Person\022\'\022\024\r \323\330=\025\334\200\262>\035d\307\267=%\250._>\035-\224u>\"\n\022\010Clothing\022\'\022\024\re\017/?\025,\240;?\035`!\032>%P\333\210>\035\221up>\"\n\022\010Clothing\022\'\022\024\r\007\315$?\025\306E%?\035\240\312\307=%\362\014\237>\035\203)d>\"\n\022\010Clothing\022$\022\024\r-\202%?\025*\331$?\035\300\240\310=%\364\r\236>\035\010\261a>\"\007\022\005Jeans\022$\022\024\rk\247\355>\025NT&?\035\2240\313=%\272\272\223>\035\243\227a>\"\007\022\005Jeans\022\'\022\024\r\245G!?\025e\340\314>\0354\222 >%;\003\213>\035#_`>\"\n\022\010Clothing\0220\022\024\rF\'\027?\025dn}>\035\360J\232=%l\244;>\035m%\\>\"\023\022\021Electronic device\022,\022\024\r\222\034\207<\025\037`\266>\035D\351\254=%\262BN>\035F\276R>\"\017\022\rLight fixture\022(\022\024\r\247s=>\025\337a\251>\035\216\365\321=%zY\r>\035LIN>\"\013\022\tTable top'
object_reader.ParseFromString(orig)
print(object_reader)
If the above string is in a file and I try and read it is as binary file and parse it, I get an error.
with open("./payload", "rb") as fd:
val = fd.read()
object_reader.ParseFromString(val)
print(object_reader)
Error:
google.protobuf.message.DecodeError: Error parsing message
On looking further it seems while reading the file in binary format it is adding extra '\' escape characters. Which looks like below.
b'\\n\\013\\010\\202\\331\\242\\233\\006\\020\\200\\256\\263F\\022$\\022\\024\\r\\255\\314\\355>\\025\\303\\255&?\\035<V\\312=%\\014H\\223>\\035\\200\\242\\376>\\"\\007\\022\\005Pants\\022\\"\\022\\024\\rmv\\036?\\025|\\032\\224>\\0354z->%\\334\\3354?\\035\\341G\\360>\\"\\005\\022\\003Man\\022\\"\\022\\024\\r\\332K\\335>\\025a\\261\\252>\\035\\\\\\3534>%z\\330\\"?\\035\\324\\334\\354>\\"\\005\\022\\003Man\\022$\\022\\024\\r-\\202%?\\025*\\331$?\\035\\300\\240\\310=%\\364\\r\\236>\\035\\2778\\352>\\"\\007\\022\\005Pants\\022\\"\\022\\024\\r^\\350-?\\025\\246\\3034?\\035\\240\\250E>%h~\\226>\\035\\256\\301\\277>\\"\\005\\022\\003Man\\022\\\'\\022\\024\\r\\262\\271\\336>\\025\\364\\261\\261>\\035T\\3310>%E\\344\\035?\\035\\261l\\253>\\"\\n\\022\\010Clothing\\022\\\'\\022\\024\\rs\\255\\037?\\025\\023\\207\\226>\\035\\3347\\\'>%\\222\\3721?\\035m\\364\\247>\\"\\n\\022\\010Clothing\\022\\"\\022\\024\\r0\\341\\337>\\025\\224\\032\\346>\\035\\354q%>%D%m>\\035\\273\\336\\240>\\"\\005\\022\\003Top\\022$\\022\\024\\r0\\341\\337>\\025\\224\\032\\346>\\035\\354q%>%D%m>\\035\\036\\035\\226>\\"\\007\\022\\005Shirt\\022\\"\\022\\024\\r\\245G!?\\025e\\340\\314>\\0354\\222 >%;\\003\\213>\\035v2\\225>\\"\\005\\022\\003Top\\022%\\022\\024\\r^\\350-?\\025\\246\\3034?\\035\\240\\250E>%h~\\226>\\035*\\204\\217>\\"\\010\\022\\006Person\\022&\\022\\024\\rB\\264!?\\025\\016r\\314>\\035\\354t\\036>%\\260\\373\\213>\\035\\037/\\217>\\"\\t\\022\\007T-shirt\\022%\\022\\024\\rJp\\325=\\025\\212\\260\\241>\\035V\\361\\344=%\\224\\036i>\\035K\\253\\213>\\"\\010\\022\\006Person\\022\\\'\\022\\024\\r\\307{\\262>\\025\\332\\360\\"<\\035\\214\\307\\252=%\\027hP>\\035\\033D\\211>\\"\\n\\022\\010Lighting\\022$\\022\\024\\r\\245G!?\\025e\\340\\314>\\0354\\222 >%;\\003\\213>\\0352t\\207>\\"\\007\\022\\005Shirt\\022\\\'\\022\\024\\r\\222\\034\\207<\\025\\037`\\266>\\035D\\351\\254=%\\262BN>\\0355\\014\\207>\\"\\n\\022\\010Lighting\\022\\"\\022\\024\\r\\370Q\\330=\\025NH\\247>\\035\\354;\\260=%\\270gQ>\\035\\215\\304\\203>\\"\\005\\022\\003Man\\0220\\022\\024\\r\\3123\\217<\\025\\272\\252#?\\035\\250\\340\\244>%z\\257\\267>\\035\\217\\353~>\\"\\023\\022\\021Electronic device\\022\\\'\\022\\024\\r\\330>\\340>\\025R\\352\\345>\\035(\\000%>%l\\255l>\\035\\265\\002~>\\"\\n\\022\\010Clothing\\022$\\022\\024\\r\\302\\3779>\\025\\206)\\244>\\035|\\337\\342=%\\360jK>\\035\\244t{>\\"\\007\\022\\005Table\\022%\\022\\024\\rmv\\036?\\025|\\032\\224>\\0354z->%\\334\\3354?\\035\\342E{>\\"\\010\\022\\006Person\\022\\\'\\022\\024\\r \\323\\330=\\025\\334\\200\\262>\\035d\\307\\267=%\\250._>\\035-\\224u>\\"\\n\\022\\010Clothing\\022\\\'\\022\\024\\re\\017/?\\025,\\240;?\\035`!\\032>%P\\333\\210>\\035\\221up>\\"\\n\\022\\010Clothing\\022\\\'\\022\\024\\r\\007\\315$?\\025\\306E%?\\035\\240\\312\\307=%\\362\\014\\237>\\035\\203)d>\\"\\n\\022\\010Clothing\\022$\\022\\024\\r-\\202%?\\025*\\331$?\\035\\300\\240\\310=%\\364\\r\\236>\\035\\010\\261a>\\"\\007\\022\\005Jeans\\022$\\022\\024\\rk\\247\\355>\\025NT&?\\035\\2240\\313=%\\272\\272\\223>\\035\\243\\227a>\\"\\007\\022\\005Jeans\\022\\\'\\022\\024\\r\\245G!?\\025e\\340\\314>\\0354\\222 >%;\\003\\213>\\035#_`>\\"\\n\\022\\010Clothing\\0220\\022\\024\\rF\\\'\\027?\\025dn}>\\035\\360J\\232=%l\\244;>\\035m%\\\\>\\"\\023\\022\\021Electronic device\\022,\\022\\024\\r\\222\\034\\207<\\025\\037`\\266>\\035D\\351\\254=%\\262BN>\\035F\\276R>\\"\\017\\022\\rLight fixture\\022(\\022\\024\\r\\247s=>\\025\\337a\\251>\\035\\216\\365\\321=%zY\\r>\\035LIN>\\"\\013\\022\\tTable top:'
I am looking for some help in reading the above file while escaping the extra escape characters. Or any other ways of reading file and loading into protobuf to parse the values.

Read contents from zipfile, apply transformation and write to new zip file in Python

I have a zip file which contains a text file(with millions of lines). I need to read line by line, apply some transformations to each line and write to a new file and zip it.
with zipfile.ZipFile("orginal.zip") as zf, zipfile.ZipFile("new.zip", "w") as new_zip:
with io.TextIOWrapper(zf.open("orginal_file.txt"), encoding="UTF-8") as fp, open("new.txt", "w") as new_txt:
for line in fp:
new_txt.write(f"{line} - NEW") # Some transformation
new_zip.writestr("new.txt", new_txt)
But I am getting following error in new_zip.writestr("new.txt", new_txt)
TypeError: object of type '_io.TextIOWrapper' has no len()
If I do transformation using the above method, will there be any out of memory issue(since the file can have millions of lines)?
How to identify the first line(since the first line is a header record)?
When I write using new_txt.write(f"{line} - NEW"), - NEW comes first in the line(For ex. if line is 003000000011000000, the output will be - NEW003000000011000000).
How can we ensure the file integrity(for ex. to ensure whether all lines are written in the new file.)
What causes the TypeError: object of type '_io.TextIOWrapper' has no len() error?
Thank You.
When you're doing:
new_zip.writestr("new.txt", new_txt)
you are trying to write the object new_txt as some data (text or equivalent) to the zip file as the file "new.txt". But the object new_txt is already a file. That's what gives you the error: TypeError: object of type '_io.TextIOWrapper' has no len() - it's expecting some content, but getting a file object.
From the docs:
Write a file into the archive. The contents is data, which may be either a str or a bytes instance;
Instead, what you probably want to do is use write(file):
new_zip.write("new.txt")
which should write the file "new.txt" into the zip file.
Regarding your other questions:
If I do transformation using the above method, will there be any out of memory issue(since the file can have millions of lines)?
Everything is being done with files, so probably no.
How to identify the first line(since the first line is a header record)?
Use a flag that gets set in the first iteration of the line loop
When I write using new_txt.write(f"{line} - NEW"), - NEW comes first in the line(For ex. if line is 003000000011000000, the output will be - NEW003000000011000000).
You are probably missing a newline \n from you transformation logic. The NEW in the front is probably coming from the previous line you wrote. Try adding a \n to the transformation & make sure there is no existing newline at the end of the input string.
How can we ensure the file integrity(for ex. to ensure whether all lines are written in the new file.)
Count the lines? Ideally, unless some error occurs all lines should be read without you having to worry about it.

Python - doc to docx file converter input, file path from a txt file

Hi stackoverflow community,
Situation,
I'm trying to run this converter found from here,
However what I want is for it to read an array of file path from a text file and convert them.
Reason being, these file path are filtered manually, so I don't have to convert unnecessary files. There are a large amount of unnecessary files in the folder.
How can I go about with this? Thank you.
with open("file_path",'r') as file_content:
content=file_content.read()
content=content.split('\n')
You can read the data of the file using the method above, Then covert the data of file into a list(or any other iteratable data type) so that we can use it with for loop.I used content=content.split('\n') to split the data of content by '\n' (Every time you press enter key, a new line character '\n' is sended), you can use any other character to split.
for i in content:
# the code you want to execute
Note
Some useful links:
Split
File writing
File read and write
By looking at your situation, I guess this is what you want (to only convert certain file in a directory), in which you don't need an extra '.txt' file to process:
import os
for f in os.listdir(path):
if f.startswith("Prelim") and f.endswith(".doc"):
convert(f)
But if for some reason you want to stick with the ".txt" processing, this may help:
with open("list.txt") as f:
lines = f.readlines()
for line in lines:
convert(line)

Iterate through file but ignore certain line break characters?

I know that I can read the entire file into memory and simply replace the offending character in memory then iterate through the stored file, but I don't want to do that because these are MASSIVE text files (often exceeding 4GB).
With that said, I want to iterate line by line through a file (which has been properly encoded as utf-8 using codecs) but I don't want line breaks to occur on the \x0b (\v) character. Unfortunately, there is some binary data that shows up in my file that has the \x0b character. Naturally, this causes a line break which ends up splitting up some lines that I need to keep intact. I'd like to ignore this character when determining where line breaks should occur while iterating through the file.
Is there a parameter or approach that will enable me to do this? I'm ok with writing my own generator to iterate line by line through the file by specifying my own valid line break characters, but I'm not sure if there isn't a simpler approach, and I'm not sure how to do this since I'm using the codecs library to handle encoding.
Here are some (sanitized) sample data:
Record#|EventID|Date| Time-UTC|Level|computer name|param_01|param_02|param_03|param_04|param_05|param_06|source name|event log
84491|682|03/19/2015| 21:59:16.000|a-pass|WKS-WINXP32BIT|sample_user|SampleGroup|(0x0,0xF38058)|RDP-Tcp#12|RogueApp|10.3.98.6|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
90582|682|04/03/2015| 14:42:14.000|a-pass|WKS-WINXP32BIT|sample_user|SampleGroup|(0x0,0x35BDF)|RDP-Tcp#5|RogueApp|10.3.98.14|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
90613|682|04/03/2015| 16:26:03.000|a-pass|WKS-WINXP32BIT|sample_user|SampleGroup|(0x0,0x35BDF)|RDP-Tcp#9|Anonymous's Mac|192.168.1.14ì°†éªá…°ê±ºà¨€A਀Aì°†éªá… ê±ºà¬€A଀Aé¶é«á… Ö Î„|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
90626|682|04/03/2015| 16:57:35.000|a-pass|WKS-WINXP32BIT|sample_user|SampleGroup|(0x0,0x35BDF)|RDP-Tcp#11|RogueApp|10.3.98.14|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91018|682|04/04/2015| 13:56:13.000|a-pass|WKS-WINXP32BIT|sample_user|SampleGroup|(0x0,0x100513C)|RDP-Tcp#33|Anonymous|10.3.58.13|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91038|682|04/04/2015| 14:09:19.000|a-pass|WKS-WINXP32BIT|sample_user|SampleGroup|(0x0,0x100513C)|RDP-Tcp#39|Anonymous's Mac|192.168.1.18ì°†éªá…°ê±ºæ¸€x渀xì°†éªá… ê±ºæ¬€x欀xé¶é«á… Ð€Ì€|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91064|682|04/04/2015| 15:25:33.000|a-pass|WKS-WINXP32BIT|ACN-Helpdesk|WKS-WINXP32BIT|(0x0,0x11FA916)|RDP-Tcp#43|CONTROLLER|10.3.58.4|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91163|682|04/04/2015| 16:40:19.000|a-pass|WKS-WINXP32BIT|Anonymous|SampleGroup|(0x0,0x37D49)|RDP-Tcp#2|Anonymous's Mac|192.168.1.18ì°†éªá…°ê±ºá´€æ®–ᴀ殖찆éªá… ê±ºã¬€æ®–㬀殖é¶é«á… Ð€Ì€|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91204|682|04/04/2015| 18:10:55.000|a-pass|WKS-WINXP32BIT|Anonymous|SampleGroup|(0x0,0x37D49)|RDP-Tcp#5|Anonymous's Mac|192.168.1.18ì°†éªá…°ê±ºæ˜€æ˜€ì°†éªá… ê±ºæ„€æ„€é¶é«á… Ð€Ì€|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91545|682|04/05/2015| 13:41:58.000|a-pass|WKS-WINXP32BIT|Anonymous|SampleGroup|(0x0,0x37D49)|RDP-Tcp#7|Anonymous's Mac|192.168.1.14ì°†éªá…°ê±ºìˆ€ìˆ€ì°†éªá… ê±ºëŒ€ëŒ€é¶é«á… Ð€Ì€|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91567|682|04/05/2015| 14:42:21.000|a-pass|WKS-WINXP32BIT|Anonymous|SampleGroup|(0x0,0x37D49)|RDP-Tcp#9|Anonymous's Mac|192.168.1.14ì°†éªá…°ê±ºæ €æ €ì°†éªá… ê±ºæ„€æ„€é¶é«á… Ð€Ì€|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
92120|682|04/06/2015| 19:06:43.000|a-pass|WKS-WINXP32BIT|ACN-Helpdesk|WKS-WINXP32BIT|(0x0,0x3D6DB)|RDP-Tcp#2|Anonymous's Mac|192.168.1.14ì°†éªá…°ê±ºç„€ç„€ì°†éªá… ê±ºçœ€çœ€é¶é«á… Ð€Ì€|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
It parses everything fine except for the very last row. Yes I know there shouldn't be binary data in a CSV file, but there is. And I have no choice in that matter.
>>> with open("out.test","wb") as f:
... f.write("a\va\nb\rq")
...
>>> for line in open("out.test","rb"):
... print line.decode("utf8")
...
a♂a
q
seems fine in python 2.7 ... what kind of encoding is this file that this wont work?

Reading Regular Expressions from a text file

I'm currently trying to write a function that takes two inputs:
1 - The URL for a web page
2 - The name of a text file containing some regular expressions
My function should read the text file line by line (each line being a different regex) and then it should execute the given regex on the web page source code. However, I've ran in to trouble doing this:
example
Suppose I want the address contained on a Yelp with URL = http://www.yelp.com/biz/liberty-grill-cork
where the regex is \<address\>\s*([^<]*)\\b\s*<. In Python, I then run:
address = re.search('\<address\>\s*([^<]*)\\b\s*<', web_page_source_code)
The above will work, however, if I just write the regex in a text file as is, and then read the regex from the text file, then it won't work. So reading the regex from a text file is what is causing the problem, how can I rectify this?
EDIT: This is how I'm reading the regexes from the text file:
with open("test_file.txt","r") as file:
for regex in file:
address = re.search(regex, web_page_source_code)
Just to add, the reason I want to read regexes from a text file is so that my function code can stay the same and I can alter my list of regexes easily. If anyone can suggest any other alternatives that would be great.
Your string has some backlashes and other things escaped to avoid special meaning in Python string, not only the regex itself.
You can easily verify what happens when you print the string you load from the file. If your backslashes doubled, you did it wrong.
The text you want in the file is:
File
\<address\>\s*([^<]*)\b\s*<
Here's how you can check it
In [1]: a = open('testfile.txt')
In [2]: line = a.readline()
-- this is the line as you'd see it in python code when properly escaped
In [3]: line
Out[3]: '\\<address\\>\\s*([^<]*)\\b\\s*<\n'
-- this is what it actually means (what re will use)
In [4]: print(line)
\<address\>\s*([^<]*)\b\s*<
OK, I managed to get it working. For anyone who wants to read regular expressions from text files, you need to do the following:
Ensure that regex in the text file is entered in the right format (thanks to MightyPork for pointing that out)
You also need to remove the newline '\n' character at the end
So overall, your code should look something like:
a = open("test_file.txt","r")
line = a.readline()
line = line.strip('\n')
result = re.search(line,page_source_code)

Categories

Resources