I am running a command line tool that returns a coloured output (similar to ls --color). I run this tool via subprocess:
process = subprocess.run(['ls --color'], shell=True, stdout=subprocess.PIPE)
process.stdout.decode()
But the result is, of course, with the color instructions like \x1b[m\x1b[m which makes further processing of the output impossible.
How can I remove the colouring and use pure text?
This works on my win10 and python 3.11 machine. Your command runs without any issue:
In my IDE I can't see color, but this command works also subprocess.run(["ls", "--color=none"], shell=True, stdout=subprocess.PIPE).
Valid arguments are on my machine:
'never', 'no', 'none'
'auto', 'tty', 'if-tty'
'always', 'yes', 'force'
Code:
import subprocess
process = subprocess.run(["ls", "-la"], shell=True, stdout=subprocess.PIPE)
with open("dirList.txt", 'w') as f:
f.write(process.stdout.decode())
Output:
total 14432
drwxr-xr-x 1 Hermann Hermann 0 Nov 19 08:00 .
drwxr-xr-x 1 Hermann Hermann 0 Aug 28 2021 ..
-rw-r--r-- 1 Hermann Hermann 0 Jan 28 09:25 .txt
-rw-r--r-- 1 Hermann Hermann 1225 Jan 6 00:51 00_Template_Program.py
-rw-r--r-- 1 Hermann Hermann 490 Jan 15 23:33 8859_1.py
-rw-r--r-- 1 Hermann Hermann 102 Jan 15 23:27 8859_1.xml
You example worked for me. You can also open the file with just 'w' and specify the encoding:
import subprocess
with open('output.txt', mode='w', encoding='utf-8') as file:
process = subprocess.run(['ls', '-l'], stdout=file)
I perform molecular dynamics wherein I process the output such that I get excel-like files as in:
-rw-r--r-- 1 staff 210030 Jan 20 16:25 summary.EKTOT
-rw-r--r-- 1 staff 210030 Jan 20 16:25 summary.EPTOT
-rw-r--r--# 1 staff 210030 Jan 20 16:25 summary.ETOT
-rw-r--r-- 1 staff 154022 Jan 20 16:25 summary.PRES
-rw-r--r--# 1 staff 105015 Jan 20 16:25 summary.DENSITY
-rw-r--r--# 1 staff 105015 Jan 20 16:25 summary.EKCMT
-rw-r--r-- 1 staff 105015 Jan 20 16:25 summary.TSOLUTE
-rw-r--r-- 1 staff 105015 Jan 20 16:25 summary.TSOLVENT
-rw-r--r-- 1 staff 105015 Jan 20 16:25 summary.VOLUME
I have downloaded these files to my computer with the goal of turning them all into csv/excel files such that the data can be entered and graphed into a jupyter notebook for record keeping. I have written the following code for this purpose (trying one file for the demo):
import pandas as pd
import os
os.chdir("Users/Downloads/analysis_BK_1")
with open(summary.DENSITY) as infile, open('summ_density.csv', 'w') as outfile:
for line in infile:
outfile.write(line.replace(' ', ','))
I keep getting the following error:
/Library/Frameworks/Python.framework/Versions/3.9/bin/python3 /Users/chrisgaughan/Desktop/Amber_Research_2020/excel_converter.py
(base) Chriss-MacBook-Pro:feature-engineering-for-machine-learning chris$ /Library/Frameworks/Python.framework/Versions/3.9/bin/python3 /Users/chrisgaughan/Desktop/Amber_Research_2020/excel_converter.py
Traceback (most recent call last):
File "/Users/Desktop/Amber_Research_2020/excel_converter.py", line 3, in <module>
os.chdir("Users/Downloads/analysis_BK_1")
FileNotFoundError: [Errno 2] No such file or directory: 'Users/chrisgaughan/Downloads/analysis_BK_1'
I have indeed checked that I am in the correct directory..if that really has anything to do with it.
Can someone please lend a hand?
I'm working on a project in which I've to do some serial communication with whichever device is connected (ttyS0, ttyS1 or ttyUSB0). Luckily I've come across a very useful stackoverflow link: "Simple way to query connected USB devices info in Python?". In this link there is a python code which works perfectly fine and it gives a proper device name and details.
here in the example code:
"/dev/bus/usb/005/002" is the device information of "FT232 Serial (UART)". SO, is there a way to find either mapping of /dev/bus/usb/005/002 with ttyS0/ ttyUSB0 or direct access of the UART with the device information and do the serial communication using "/dev/bus/usb/< bus >/< device >" instead of ttyS0 or ttyUSB0.
python code:
import re
import subprocess
device_re = re.compile("Bus\s+(?P<bus>\d+)\s+Device\s+(?P<device>\d+).+ID\s(?P<id>\w+:\w+)\s(?P<tag>.+)$", re.I)
df = subprocess.check_output("lsusb")
devices = []
for i in df.split('\n'):
if i:
info = device_re.match(i)
if info:
dinfo = info.groupdict()
dinfo['device'] = '/dev/bus/usb/%s/%s' % (dinfo.pop('bus'), dinfo.pop('device'))
devices.append(dinfo)
print devices
result:
{'device': '/dev/bus/usb/001/001', 'tag': 'Linux Foundation 2.0 root hub', 'id': '1d6b:0002'}
{'device': '/dev/bus/usb/005/002', 'tag': 'Future Technology Devices International, Ltd FT232 Serial (UART) IC', 'id': '0403:6001'}
{'device': '/dev/bus/usb/005/001', 'tag': 'Linux Foundation 1.1 root hub', 'id': '1d6b:0001'}
{'device': '/dev/bus/usb/004/003', 'tag': 'Lite-On Technology Corp. ', 'id': '04ca:0061'}
{'device': '/dev/bus/usb/004/002', 'tag': 'Dell Computer Corp. ', 'id': '413c:2107'}
{'device': '/dev/bus/usb/004/001', 'tag': 'Linux Foundation 1.1 root hub', 'id': '1d6b:0001'}
{'device': '/dev/bus/usb/003/001', 'tag': 'Linux Foundation 1.1 root hub', 'id': '1d6b:0001'}
{'device': '/dev/bus/usb/002/001', 'tag': 'Linux Foundation 1.1 root hub', 'id': '1d6b:0001'}
Thanking with regards
Aatif shaikh
A useful utilities is udevadm. For example, I have this USB serial adapter:
$ lsusb|grep -i prolific
Bus 001 Device 077: ID 067b:2303 Prolific Technology, Inc. PL2303 Serial Port
Running udevadm on it yields a bunch of information. Here's the start of it:
$ udevadm info -a /dev/bus/usb/001/077
Udevadm info starts with the device specified by the devpath and then
walks up the chain of parent devices. It prints for every device
found, all possible attributes in the udev rules key format.
A rule to match, can be composed by the attributes of the device
and the attributes from one single parent device.
looking at device '/devices/pci0000:00/0000:00:14.0/usb1/1-3/1-3.4/1-3.4.4':
KERNEL=="1-3.4.4"
SUBSYSTEM=="usb"
DRIVER=="usb"
ATTR{authorized}=="1"
ATTR{avoid_reset_quirk}=="0"
ATTR{bConfigurationValue}=="1"
ATTR{bDeviceClass}=="00"
ATTR{bDeviceProtocol}=="00"
ATTR{bDeviceSubClass}=="00"
ATTR{bMaxPacketSize0}=="64"
ATTR{bMaxPower}=="100mA"
ATTR{bNumConfigurations}=="1"
ATTR{bNumInterfaces}==" 1"
ATTR{bcdDevice}=="0300"
ATTR{bmAttributes}=="a0"
ATTR{busnum}=="1"
ATTR{configuration}==""
ATTR{devnum}=="77"
ATTR{devpath}=="3.4.4"
ATTR{idProduct}=="2303"
ATTR{idVendor}=="067b"
ATTR{ltm_capable}=="no"
ATTR{manufacturer}=="Prolific Technology Inc."
ATTR{maxchild}=="0"
ATTR{product}=="USB-Serial Controller"
ATTR{quirks}=="0x0"
ATTR{removable}=="unknown"
ATTR{speed}=="12"
ATTR{urbnum}=="22"
ATTR{version}==" 2.00"
You can then look into sysfs for more details:
$ ls -l '/sys//devices/pci0000:00/0000:00:14.0/usb1/1-3/1-3.4/1-3.4.4'
total 0
drwxr-xr-x 7 root root 0 Oct 11 11:03 1-3.4.4:1.0
-r--r--r-- 1 root root 4096 Oct 11 11:03 bcdDevice
-rw-r--r-- 1 root root 4096 Oct 11 11:03 bConfigurationValue
-r--r--r-- 1 root root 4096 Oct 11 11:03 bDeviceClass
-r--r--r-- 1 root root 4096 Oct 11 11:03 bDeviceProtocol
-r--r--r-- 1 root root 4096 Oct 11 11:03 bDeviceSubClass
-r--r--r-- 1 root root 4096 Oct 11 11:03 bmAttributes
-r--r--r-- 1 root root 4096 Oct 11 11:03 bMaxPacketSize0
-r--r--r-- 1 root root 4096 Oct 11 11:03 bMaxPower
-r--r--r-- 1 root root 4096 Oct 11 11:03 bNumConfigurations
-r--r--r-- 1 root root 4096 Oct 11 11:03 bNumInterfaces
-r--r--r-- 1 root root 4096 Oct 11 11:03 busnum
-r--r--r-- 1 root root 4096 Oct 11 11:03 configuration
-r--r--r-- 1 root root 65553 Oct 11 11:03 descriptors
-r--r--r-- 1 root root 4096 Oct 11 11:03 dev
-r--r--r-- 1 root root 4096 Oct 11 11:03 devnum
-r--r--r-- 1 root root 4096 Oct 11 11:03 devpath
lrwxrwxrwx 1 root root 0 Oct 11 11:03 driver -> ../../../../../../../bus/usb/drivers/usb
drwxr-xr-x 3 root root 0 Oct 11 11:03 ep_00
-r--r--r-- 1 root root 4096 Oct 11 11:03 idProduct
-r--r--r-- 1 root root 4096 Oct 11 11:03 idVendor
-r--r--r-- 1 root root 4096 Oct 11 11:03 ltm_capable
-r--r--r-- 1 root root 4096 Oct 11 11:03 manufacturer
-r--r--r-- 1 root root 4096 Oct 11 11:03 maxchild
lrwxrwxrwx 1 root root 0 Oct 11 11:03 port -> ../1-3.4:1.0/1-3.4-port4
drwxr-xr-x 2 root root 0 Oct 11 11:03 power
-r--r--r-- 1 root root 4096 Oct 11 11:03 product
-r--r--r-- 1 root root 4096 Oct 11 11:03 quirks
-r--r--r-- 1 root root 4096 Oct 11 11:03 removable
--w------- 1 root root 4096 Oct 11 11:03 remove
-r--r--r-- 1 root root 4096 Oct 11 11:03 speed
lrwxrwxrwx 1 root root 0 Oct 11 11:03 subsystem -> ../../../../../../../bus/usb
-rw-r--r-- 1 root root 4096 Oct 11 11:03 uevent
-r--r--r-- 1 root root 4096 Oct 11 11:03 urbnum
-r--r--r-- 1 root root 4096 Oct 11 11:03 version
You'll notice a sub-directory (1-3.4.4:1.0) for each implemented USB device(interface? function?) within the adapter (in my case just one; I have other 4-port USB serial adapter with 4 sub-directories too). If you look in there, you can eventually find the device node:
$ ls -l '/sys//devices/pci0000:00/0000:00:14.0/usb1/1-3/1-3.4/1-3.4.4/1-3.4.4:1.0/'
total 0
-rw-r--r-- 1 root root 4096 Oct 11 11:03 authorized
-r--r--r-- 1 root root 4096 Oct 11 11:03 bAlternateSetting
-r--r--r-- 1 root root 4096 Oct 11 11:03 bInterfaceClass
-r--r--r-- 1 root root 4096 Oct 11 11:03 bInterfaceNumber
-r--r--r-- 1 root root 4096 Oct 11 11:03 bInterfaceProtocol
-r--r--r-- 1 root root 4096 Oct 11 11:03 bInterfaceSubClass
-r--r--r-- 1 root root 4096 Oct 11 11:03 bNumEndpoints
lrwxrwxrwx 1 root root 0 Oct 11 11:03 driver -> ../../../../../../../../bus/usb/drivers/pl2303
drwxr-xr-x 3 root root 0 Oct 11 11:03 ep_02
drwxr-xr-x 3 root root 0 Oct 11 11:03 ep_81
drwxr-xr-x 3 root root 0 Oct 11 11:03 ep_83
-r--r--r-- 1 root root 4096 Oct 11 11:03 modalias
drwxr-xr-x 2 root root 0 Oct 11 11:03 power
lrwxrwxrwx 1 root root 0 Oct 11 11:03 subsystem -> ../../../../../../../../bus/usb
-r--r--r-- 1 root root 4096 Oct 11 11:03 supports_autosuspend
drwxr-xr-x 4 root root 0 Oct 11 11:03 ttyUSB0
-rw-r--r-- 1 root root 4096 Oct 11 11:03 uevent
$ ls -l '/sys//devices/pci0000:00/0000:00:14.0/usb1/1-3/1-3.4/1-3.4.4/1-3.4.4:1.0/ttyUSB0'
total 0
lrwxrwxrwx 1 root root 0 Oct 11 11:03 driver -> ../../../../../../../../../bus/usb-serial/drivers/pl2303
-r--r--r-- 1 root root 4096 Oct 11 11:03 port_number
drwxr-xr-x 2 root root 0 Oct 11 11:03 power
lrwxrwxrwx 1 root root 0 Oct 11 11:03 subsystem -> ../../../../../../../../../bus/usb-serial
drwxr-xr-x 3 root root 0 Oct 11 11:03 tty
-rw-r--r-- 1 root root 4096 Oct 11 11:03 uevent
$ ls -l '/sys//devices/pci0000:00/0000:00:14.0/usb1/1-3/1-3.4/1-3.4.4/1-3.4.4:1.0/ttyUSB0/tty'
total 0
drwxr-xr-x 3 root root 0 Oct 11 11:11 ttyUSB0
$ ls -l '/sys//devices/pci0000:00/0000:00:14.0/usb1/1-3/1-3.4/1-3.4.4/1-3.4.4:1.0/ttyUSB0/tty/ttyUSB0'
total 0
-r--r--r-- 1 root root 4096 Oct 11 11:11 dev
lrwxrwxrwx 1 root root 0 Oct 11 11:11 device -> ../../../ttyUSB0
drwxr-xr-x 2 root root 0 Oct 11 11:11 power
lrwxrwxrwx 1 root root 0 Oct 11 11:11 subsystem -> ../../../../../../../../../../../class/tty
-rw-r--r-- 1 root root 4096 Oct 11 11:11 uevent
$ cat '/sys//devices/pci0000:00/0000:00:14.0/usb1/1-3/1-3.4/1-3.4.4/1-3.4.4:1.0/ttyUSB0/tty/ttyUSB0/dev'
188:0
That last line is the device node major/minor number, which you can search for in /dev, or use to create your own device node.
The directory /sys/class/tty/ lists all ttys, and /sys/bus/usb-serial/devices/ lists all USB serial adapters. It might be easier to start from those directories (rather than udevadm) to find all relevant devices, then determine which one of those matches the USB device you care about.
$ ls -l /sys/bus/usb-serial/devices/
total 0
lrwxrwxrwx 1 root root 0 Oct 11 11:16 ttyUSB0 -> ../../../devices/pci0000:00/0000:00:14.0/usb1/1-3/1-3.4/1-3.4.4/1-3.4.4:1.0/ttyUSB0/
lrwxrwxrwx 1 root root 0 Sep 6 13:48 ttyUSB1 -> ../../../devices/pci0000:00/0000:00:14.0/usb1/1-4/1-4:1.1/ttyUSB1/
lrwxrwxrwx 1 root root 0 Oct 11 11:16 ttyUSB10 -> ../../../devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8.4/1-8.4:1.3/ttyUSB10/
lrwxrwxrwx 1 root root 0 Sep 6 13:48 ttyUSB12 -> ../../../devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8.6/1-8.6:1.0/ttyUSB12/
lrwxrwxrwx 1 root root 0 Sep 6 13:48 ttyUSB13 -> ../../../devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8.6/1-8.6:1.1/ttyUSB13/
lrwxrwxrwx 1 root root 0 Sep 6 13:48 ttyUSB14 -> ../../../devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8.6/1-8.6:1.2/ttyUSB14/
lrwxrwxrwx 1 root root 0 Sep 6 13:48 ttyUSB15 -> ../../../devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8.6/1-8.6:1.3/ttyUSB15/
lrwxrwxrwx 1 root root 0 Sep 6 13:48 ttyUSB2 -> ../../../devices/pci0000:00/0000:00:14.0/usb1/1-4/1-4:1.2/ttyUSB2/
lrwxrwxrwx 1 root root 0 Sep 6 13:48 ttyUSB3 -> ../../../devices/pci0000:00/0000:00:14.0/usb1/1-4/1-4:1.3/ttyUSB3/
lrwxrwxrwx 1 root root 0 Oct 11 11:16 ttyUSB4 -> ../../../devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8.4/1-8.4:1.0/ttyUSB4/
lrwxrwxrwx 1 root root 0 Sep 6 13:48 ttyUSB5 -> ../../../devices/pci0000:00/0000:00:14.0/usb1/1-7/1-7:1.1/ttyUSB5/
lrwxrwxrwx 1 root root 0 Sep 6 13:48 ttyUSB6 -> ../../../devices/pci0000:00/0000:00:14.0/usb1/1-7/1-7:1.2/ttyUSB6/
lrwxrwxrwx 1 root root 0 Sep 6 13:48 ttyUSB7 -> ../../../devices/pci0000:00/0000:00:14.0/usb1/1-7/1-7:1.3/ttyUSB7/
lrwxrwxrwx 1 root root 0 Oct 11 11:16 ttyUSB8 -> ../../../devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8.4/1-8.4:1.1/ttyUSB8/
lrwxrwxrwx 1 root root 0 Oct 11 11:16 ttyUSB9 -> ../../../devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8.4/1-8.4:1.2/ttyUSB9/
If you assume that you can take those path names (the link values) and go up to ../.. and find the USB device, and find its bus/dev number. I /think/ this is safe, but it depends on whether different USB devices/drivers always lay out their sysfs identically, which I believe they do for a given device type.
$ cat '/sys/devices/pci0000:00/0000:00:14.0/usb1/1-3/1-3.4/1-3.4.4/1-3.4.4:1.0/ttyUSB0/../../busnum'
1
$ cat '/sys/devices/pci0000:00/0000:00:14.0/usb1/1-3/1-3.4/1-3.4.4/1-3.4.4:1.0/ttyUSB0/../../devnum'
77
In general, sysfs contains all the information you want; you just have to work out how to traverse it.
Re: your comment on your question of "I'm just trying to find the link between 'device': '/dev/bus/usb/005/002', 'tag': 'Future Technology Devices International, Ltd FT232 Serial (UART) IC', 'id': '0403:6001'} and ttyUSB0. so that I can always check what type of device is connected to any serial port before I begin the serial communication.":
Be careful not to read too much into the device name or USB device/vendor ID or name. Here's why.
Device node name/path:
- There are different types (protocols) of USB serial adapters. For example, I have both /dev/ttyUSB0 and /dev/ttyACM0 on my system for different types of adapter.
- The user can use udev rules to rename or symlink to the device node. For example, on my system, I have a rule that created /dev/console-nano which is a symlink to some ttyUSB node such as /dev/ttyUSB0 (the exact value depends on the order the devices are enumerated).
Thus, you should only check (via sysfs) whether the device node is a tty and perhaps also whether it's backed by the USB subsystem, or similar subsystems such as ISA/PCI/PCIe serial ports.
USB vendor/device ID:
There are many many vendors of USB serial adapters, even just at the chip level. For example, FTDI (many devices e.g. FT232R, FT4232H, ...), CH430, PL2303, Linux USB device port (common on embedded hardware) acting as a CDC ACM serial device, etc.
Thus, you should detect devices based on their capability (being a tty or serial port), rather than based on their specific USB device/vendor ID.
Tue Aug 21 17:02:26 2018 (gtgrhrthrhrhrthhhthrthrhrh)
fjfpjpgporejpejgjr[eh[[[jh[j[ej[[ej[ej[e]]]]
fkw[kgkeg[ekrk[ekg[kergk[erkg[eg[kg]
Tue Aug 21 17:31:06 2018 ( ijwejfwfjwpfjwf[[few[jjfwfefwfeffeww]]
fiowhfiweohewhfpwfhpfhpepwehfphpwhfpehfpwfh
f,wfpewfefewgpwpg,pewgp
Tue Aug 21 18:10:42 2018 ( reijpjfpjejferjfrejfpjefjer
k[pfk[epkf[kr[ek[ke[gkk]
r[g[keprkgpekg[rkg[pkg[ekg]
Above is an example of the content in the text file. I want to extract a string with re.
How should I construct the findall condition to achieve the expected result below? I have tried the following:
match=re.findall(r'[Tue\w]+2018$',data2)
but it is not working. I understand that $ is the symbol for the end of the string. How can I do it?
Expected Result is:
Tue Aug 21 17:02:26 2018
Tue Aug 21 17:31:06 2018
Tue Aug 21 18:10:42 2018
.
.
.
Use the pattern:
^Tue.*?2018
^ Assert position beginning of line.
Tue Literal substring.
.*? Match anything lazily.
2018 Match literal substring.
Since you are working with a multiline string and you want to match pattern at the beginning of a string, you have to use the re.MULTILINE flag.
import re
mystr="""
Tue Aug 21 17:02:26 2018 (gtgrhrthrhrhrthhhthrthrhrh)
fjfpjpgporejpejgjr[eh[[[jh[j[ej[[ej[ej[e]]]]
fkw[kgkeg[ekrk[ekg[kergk[erkg[eg[kg]
Tue Aug 21 17:31:06 2018 ( ijwejfwfjwpfjwf[[few[jjfwfefwfeffeww]]
fiowhfiweohewhfpwfhpfhpepwehfphpwhfpehfpwfh
f,wfpewfefewgpwpg,pewgp
Tue Aug 21 18:10:42 2018 ( reijpjfpjejferjfrejfpjefjer
k[pfk[epkf[kr[ek[ke[gkk]
r[g[keprkgpekg[rkg[pkg[ekg]
"""
print(re.findall(r'^Tue.*?2018',mystr,re.MULTILINE))
Prints:
['Tue Aug 21 17:02:26 2018', 'Tue Aug 21 17:31:06 2018', 'Tue Aug 21 18:10:42 2018']
I'm trying to pull the data contained within FTP LIST.
I'm using regex within Python 2.7.
test = "-rw-r--r-- 1 owner group 75148624 Jan 6 2015 somename.csv-rw-r--r-- 1 owner group 223259072 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 4041411 Jun 5 2015 somename-adjusted.csv-rw-r--r-- 1 owner group 2879228 May 13 2015 somename.csv-rw-r--r-- 1 owner group 11832668 Feb 13 2015 somename.csv-rw-r--r-- 1 owner group 1510522 Feb 19 2015 somename.csv-rw-r--r-- 1 owner group 2826664 Feb 25 2015 somename.csv-rw-r--r-- 1 owner group 582985 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 212427 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 3015592 Feb 27 2015 somename.csv-rw-r--r-- 1 owner group 103576 Feb 27 2015 somename-corrected.csv"
(now without code formatting so you can see it without scrolling)
test = "-rw-r--r-- 1 owner group 75148624 Jan 6 2015 somename.csv-rw-r--r-- 1 owner group 223259072 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 4041411 Jun 5 2015 somename-adjusted.csv-rw-r--r-- 1 owner group 2879228 May 13 2015 somename.csv-rw-r--r-- 1 owner group 11832668 Feb 13 2015 somename.csv-rw-r--r-- 1 owner group 1510522 Feb 19 2015 somename.csv-rw-r--r-- 1 owner group 2826664 Feb 25 2015 somename.csv-rw-r--r-- 1 owner group 582985 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 212427 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 3015592 Feb 27 2015 somename.csv-rw-r--r-- 1 owner group 103576 Feb 27 2015 somename-corrected.csv"
I've tried various incarnations of the following
from re import compile
ftp_list_re = compile('(?P<permissions>[d-][rwx-]{9})[\s]{1,20}'
'(?P<links>[0-9]{1,8})[\s]{1,20}'
'(?P<owner>[0-9A-Za-z_-]{1,16})[\s]{1,20}'
'(?P<group>[0-9A-Za-z_-]{1,16})[\s]{1,20}'
'(?P<size>[0-9]{1,16})[\s]{1,20}'
'(?P<month>[A-Za-z]{0,3})[\s]{1,20}'
'(?P<date>[0-9]{1,2})[\s]{1,20}'
'(?P<timeyear>[0-9:]{4,5})[\s]{1,20}'
'(?P<filename>[\s\w\.\-]+)(?=[drwx\-]{10})')
with the last line as
'(?P<filename>.+)(?=[drwx\-]{10})')
'(?P<filename>.+(?=[drwx\-]{10}))')
and originally,
'(?P<filename>[\s\w\.\-]+(?=[drwx\-]{10}|$))')
so i can capture the last entry
but regardless, I keep getting the following output
ftp_list_re.findall(test)
[('-rw-r--r--',
'1',
'owner',
'group',
'75148624',
'Jan',
'6',
'2015',
'somename.csv-rw-r--r-- 1 owner group 223259072 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 4041411 Jun 5 2015 somename-adjusted.csv-rw-r--r-- 1 owner group 2879228 May 13 2015 somename.csv-rw-r--r-- 1 owner group 11832668 Feb 13 2015 somename.csv-rw-r--r-- 1 owner group 1510522 Feb 19 2015 somename.csv-rw-r--r-- 1 owner group 2826664 Feb 25 2015 somename.csv-rw-r--r-- 1 owner group 582985 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 212427 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 3015592 Feb 27 2015 somename.csv')]
What am I doing wrong?
You should make sub-pattern before lookahead non-greedy. Further your regex can be shortened a bit like this:
(?P<permissions>[d-][rwx-]{9})\s{1,20}(?P<links>\d{1,8})\s{1,20}(?P<owner>[\w-]{1,16})\s{1,20}(?P<group>[\w-]{1,16})\s{1,20}(?P<size>\d{1,16})\s{1,20}(?P<month>[A-Za-z]{0,3})\s{1,20}(?P<date>\d{1,2})\s{1,20}(?P<timeyear>[\d:]{4,5})\s{1,20}(?P<filename>[\s\w.-]+?)(?=[drwx-]{10}|$)
Or using compile:
from re import compile
ftp_list_re = compile('(?P<permissions>[d-][rwx-]{9})\s{1,20}'
'(?P<links>\d{1,8})\s{1,20}'
'(?P<owner>[\w-]{1,16})\s{1,20}'
'(?P<group>[\w-]{1,16})\s{1,20}'
'(?P<size>\d{1,16})\s{1,20}'
'(?P<month>[A-Za-z]{0,3})\s{1,20}'
'(?P<date>\d{1,2})\s{1,20}'
'(?P<timeyear>[\d:]{4,5})\s{1,20}'
'(?P<filename>[\s\w.-]+?)(?=[drwx-]{10}|$)')
RegEx Demo
Code:
import re
p = re.compile(ur'(?P<permissions>[d-][rwx-]{9})\s{1,20}(?P<links>\d{1,8})\s{1,20}(?P<owner>[\w-]{1,16})\s{1,20}(?P<group>[\w-]{1,16})\s{1,20}(?P<size>[0-9]{1,16})\s{1,20}(?P<month>[A-Za-z]{0,3})\s{1,20}(?P<date>[0-9]{1,2})\s{1,20}(?P<timeyear>[\d:]{4,5})\s{1,20}(?P<filename>[\s\w.-]+?)(?=[drwx-]{10}|$)')
test_str = u"-rw-r--r-- 1 owner group 75148624 Jan 6 2015 somename.csv-rw-r--r-- 1 owner group 223259072 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 4041411 Jun 5 2015 somename-adjusted.csv-rw-r--r-- 1 owner group 2879228 May 13 2015 somename.csv-rw-r--r-- 1 owner group 11832668 Feb 13 2015 somename.csv-rw-r--r-- 1 owner group 1510522 Feb 19 2015 somename.csv-rw-r--r-- 1 owner group 2826664 Feb 25 2015 somename.csv-rw-r--r-- 1 owner group 582985 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 212427 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 3015592 Feb 27 2015 somename.csv-rw-r--r-- 1 owner group 103576 Feb 27 2015 somename-corrected.csv"
re.findall(p, test_str)
Regular expression quantifiers are by default "greedy" which means that they will "eat" as much as possible.
[\s\w\.\-]+
means to find at least one AND AS MANY AS POSSIBLE of whitespace, word, dot, or dash characters. The look ahead prevents it from eating the entire input (actually the regex engine will eat the entire input and then start backing off as needed), which means that it eats each file specification line, except for the last (which the look ahead insists must be left).
Adding a ? after a quantifier (*?, +?, ??, and so on) makes the quantifier "lazy" or "reluctant". This changes the meaning of "+" from "match at least one and as many as possible" to "match at least one and no more than necessary".
Therefore changing that last + to a +? should fix your problem.
The problem wasn't with the look ahead, which works just fine, but with the last subexpression before it.
EDIT:
Even with this change, your regular expression will not parse that last file specification line. This is because the regular expressions INSISTS that there must be a permission spec after the filename. To fix this, we must allow that look ahead to not match (but require it to match at everything BUT the last specification). Making the following change will fix that
ftp_list_re = compile('(?P<permissions>[d-][rwx-]{9})[\s]{1,20}'
'(?P<links>[0-9]{1,8})[\s]{1,20}'
'(?P<owner>[0-9A-Za-z_-]{1,16})[\s]{1,20}'
'(?P<group>[0-9A-Za-z_-]{1,16})[\s]{1,20}'
'(?P<size>[0-9]{1,16})[\s]{1,20}'
'(?P<month>[A-Za-z]{0,3})[\s]{1,20}'
'(?P<date>[0-9]{1,2})[\s]{1,20}'
'(?P<timeyear>[0-9:]{4,5})[\s]{1,20}'
'(?P<filename>[\s\w\.\-]+?)(?=(?:(?:[drwx\-]{10})|$))')
What I have done here (besides making that last + lazy) is to make the lookahead check two possibilities - either a permission specification OR an end of string. The ?: are to prevent those parentheses from capturing (otherwise you will end up with undesired extra data in your matches).
Fixed your last line, filename group was not working. See fixed regex and the demo below:
(?P<permissions>[d-][rwx-]{9})[\s]{1,20}
(?P<links>[0-9]{1,8})[\s]{1,20}
(?P<owner>[0-9A-Za-z_-]{1,16})[\s]{1,20}
(?P<group>[0-9A-Za-z_-]{1,16})[\s]{1,20}
(?P<size>[0-9]{1,16})[\s]{1,20}
(?P<month>[A-Za-z]{0,3})[\s]{1,20}
(?P<date>[0-9]{1,2})[\s]{1,20}
(?P<timeyear>[0-9:]{4,5})[\s]{1,20}
(?P<filename>[\w\-]+.\w+)
Demo here:
With the PyPi regex module that allows to split with an empty match, you can do the same in a more simple way, without having to describe all fields:
import regex
fields = ('permissions', 'links', 'owner', 'group', 'size', 'month', 'day', 'year', 'filename')
p = regex.compile(r'(?=[d-](?:[r-][w-][x-]){3})', regex.V1)
res = [dict(zip(fields, x.split(None, 9))) for x in p.split(test)[1:]]