Python parsing table data into an array

Python parsing table data into an array - python

Firstly I am using python 2.7.
What I am trying to achieve is to separate table data from a active directory lookup into "Firstname Lastname" items in an array, which I can later compare to a different array and see which users in the list do not match.
I run dsquery group domainroot -name groupname | dsget group -members | dsget user -fn -ln which outputs a list as such:
fn ln
Peter Brill
Cliff Lach
Michael Tsu
Ashraf Shah
Greg Coultas
Yi Li
Brad Black
Kevin Schulte
Raymond Masters (Admin)
James Rapp
Allison Wurst
Benjamin Hammel
Edgar Cuevas
Vlad Dorovic (Admin)
Will Wang
dsget succeeded
Notice this list has both spaces before and after each data set.
The code I am using currently:
userarray = []
p = Popen(["cmd.exe"], stdin=PIPE, stdout=PIPE)
p.stdin.write("dsquery group domainroot -name groupname | dsget group -members | dsget user -fn -ln\n")
p.stdin.write("exit\n")
processStdout = p.stdout.read().replace("\r\n", "").strip("")[266:]
cutWhitespace = ' '.join(processStdout.split()).split("dsget")[0]
processSplit = re.findall('[A-Z][^A-Z]*', cutWhitespace)
userarray.append(processSplit)
print userarray
My problem is that when I split on the blank space and attempt to re-group them into "Firstname Lastname" when it hits the line in the list that has (Admin) the grouping gets thrown off because there is a third field. Here is a sample of what I mean:
['Brad ', 'Black ', 'Kevin ', 'Schulte ', 'Raymond ', 'Masters (', 'Admin) ', 'James ', 'Rapp ', 'Allison ', 'Wurst ',
I would appreciate any suggestions on how to group this better or correctly. Thanks!

# the whole file.
content = p.stdout.read()
# each line as a single string
lines = content.split()
# lets drop the header and the last line
lines = lines[1:-1]
# Notice how the last name starts at col 19
names = [(line[:19].strip(), line[19:].strip()) for line in lines]
print(names)
=> [('Peter', 'Brill'), ('Cliff', 'Lach'), ('Michael', 'Tsu'), ('Ashraf', 'Shah'), ('Greg', 'Coultas'), ('Yi', 'Li'), ('Brad', 'Black'), ('Kevin', 'Schulte'), ('Raymond', 'Masters (Admin)'), ('James', 'Rapp'), ('Allison', 'Wurst'), ('Benjamin', 'Hammel'), ('Edgar', 'Cuevas'), ('Vlad', 'Dorovic (Admin)'), ('Will', 'Wang')]
Now, if the column size change, just do index = lines[0].indexof('ln') before dropping the header and use that instead of 19

split has a maxsplit argument you so you can tell it to split only the first separator, so you could say:
cutWhitespace = ' '.join(processStdout.split(None,1)).split("dsget")[0]
on your sixth line to tell it to split no more than once.

Related

How to extract data from a list with addtional spaces in between them in python

The code is trying to extract from a file: (format: group, team, val1, val2). However, some results are correct if there is no additional space and produces wrong result in lines with additional spaces in between.
data = {}
with open('source.txt') as f:
for line in f:
print ("this is the line data: ", line)
needed = line.split()[0:2]
print ("this is what i need: ", needed)
source.txt #-- format: group, team, val1, val2
alpha diehard group 1 54,00.01
bravo nevermindteam 3 500,000.00
charlie team ultimatum 1 27,722.29 ($250.45)
charlie team ultimatum 10 252,336,733.383 ($492.06)
delta beyond-imagination 2 11 ($10)
echo double doubt 5 143,299.00 ($101)
echo double doubt 8 145,300 ($125.01)
falcon revengers 3 0.1234
falcon revengers 5 9.19
lima almost done 6 45.00181 ($38.9)
romeo ontheway home 12 980
I am trying to just extract the values before val1. #-- group, team
alpha diehard group
bravo nevermindteam
charlie team ultimatum
delta beyond-imagination
echo double doubt
falcon revengers
lima almost done
romeo ontheway home

Use regex.
import regex as re
with open('source.txt') as f:
for line in f:
found = re.search("(.*?)\d", line)
needed = found.group(1).split()[0:3]
print(needed)
Output:
['alpha', 'diehard', 'group']
['bravo', 'nevermindteam']
['charlie', 'team', 'ultimatum']
['charlie', 'team', 'ultimatum']
['delta', 'beyond-imagination']
['echo', 'double', 'doubt']
['echo', 'double', 'doubt']
['falcon', 'revengers']
['falcon', 'revengers']
['lima', 'almost', 'done']
['romeo', 'ontheway', 'home']

Here's how I did it, basically iterate through all the words and stop when you hit a numeric:
data = {}
with open('source.txt') as f:
for line in f:
print ("this is the line data: ", line)
split_line = line.split()
for i in range (len(split_line)):
if split_line[i].isnumeric():
break
needed = split_line[0:i]
print ("this is what i need: ", needed)

Try with
with open('source.txt') as f:
for line in f:
new_line = ' '.join(filter(lambda s: s.isalpha() , l.split(' ')))
print(new_line)
The code is sensible to the amount of white spaces.
With regex
import regex
with open('source.txt', 'r') as f:
text = re.sub(r'[0-9|,|\.|\(|\)|\$|\s]+\n', '\n', f.read()+'\n', re.M)

How to remove duplicates without pandas?

This is the data
row1| sbkjd nsdnak ABC
row2| vknfe edcmmi ABC
row3| fjnfn msmsle XYZ
row4| sdkmm tuiepd XYZ
row5| adjck rulsdl LMN
I have already tried this using pandas and got help from stackoverflow. But, I want to be able to remove the duplicates without having to use the pandas library or any library in general. So, only one of the rows having "ABC" must be chosen, only one of the rows having "XYZ" must be chosen and the last row is unique, so, it should be chosen. How do I do this?
So, my final output should contain this:
[ row1 or row2 + row3 or row4 + row5 ]

This should only select the unique rows from your original table. If there are two or more rows which share duplicate data, it will select the first row.
data = [["sbkjd", "nsdnak", "ABC"],
["vknfe", "edcmmi", "ABC"],
["fjnfn", "msmsle", "XYZ"],
["sdkmm", "tuiepd", "XYZ"],
["adjck", "rulsdl", "LMN"]]
def check_list_uniqueness(candidate_row, unique_rows):
for element in candidate_row:
for unique_row in unique_rows:
if element in unique_row:
return False
return True
final_rows = []
for row in data:
if check_list_uniqueness(row, final_rows):
final_rows.append(row)
print(final_rows)

This Bash command would do (assuming your data is in a file called test, and that values of column 4 do not appear in other columns)
cut -d ' ' -f 4 test | tr '\n' ' ' | sed 's/\([a-zA-Z][a-zA-Z]*[ ]\)\1/\1/g' | tr ' ' '\n' | while read str; do grep -m 1 $str test; done
cut -d ' ' -f 4 test chooses the data in the fourth column
tr '\n' ' ' turns the column into a row (translating new line character to a space)
sed 's/\([a-zA-Z][a-zA-Z]*[ ]\)\1/\1/g' deletes the repetitions
tr ' ' '\n' turns the row of unique values to a column
while read str; do grep -m 1 $str test; done reads the unique words and prints the first line from test that matches that word

python split and reverse text file

I have a text file which stores data like name : score e.g.:
bob : 10
fred : 3
george : 5
However, I want to make it so it says
10 : bob
3 : fred
5 : george
What would the code be to flip it like that?
Would I need to separate them first by removing the colon as I have managed this through this code?
file = open("Class 3.txt", "r")
t4 = (file.read())
test =''.join(t4.split(':')[0:10])
print (test)
How would I finish it and make it say the reverse?

This code handles fractional scores (e.g. 9.5), and doesn't care whether there are extra spaces around the : delimiter. It should be much easier to maintain than your current code.
Class 3.txt:
bob : 10
fred : 3
george : 5
Code:
class_num = input('Which class (1, 2, or 3)? ')
score_sort = input('Sort by name or score? ').lower().startswith('s')
with open("Class " + class_num + ".txt", "r") as f:
scores = {name.strip():float(score) for
name,score in (line.strip().split(':') for line in f)}
if score_sort:
for name in sorted(scores, key=scores.get, reverse=True):
print(scores.get(name), ':', name)
else:
for name in sorted(scores):
print(name, ':', scores.get(name))
Input:
3
scores
Output:
10.0 : bob
5.0 : george
3.0 : fred
Input:
3
name
Output:
bob : 10.0
fred : 3.0
george : 5.0

First, this is going to be a lot harder to do whole-file-at-once than line-at-a-time.
But, either way, you obviously can't just split(':') and then ''.join(…). All that's going to do is replace colons with nothing. You obviously need ':'.join(…) to put the colons back in.
And meanwhile, you have to swap the values around on each side of each colon.
So, here's a function that takes just one line, and swaps the sides:
def swap_sides(line):
left, right = line.split(':')
return ':'.join((right, left))
But you'll notice there's a few problems here. The left has a space before the colon; the right has a space after the colon, and a newline at the end. How are you going to deal with that?
The simplest way is to just strip out all the whitespace on both sides, then add back in the whitespace you want:
def swap_sides(line):
left, right = line.split(':')
return ':'.join((right.strip() + ' ', ' ' + left.strip())) + '\n'
But a smarter idea is to treat the space around the colon as part of the delimiter. (The newline, you'll still need to handle manually.)
def swap_sides(line):
left, right = line.strip().split(' : ')
return ' : '.join((right.strip(), left.strip())) + '\n'
But if you think about it, do you really need to add the newline back on? If you're just going to pass it to print, the answer is obviously no. So:
def swap_sides(line):
left, right = line.strip().split(' : ')
return ' : '.join((right.strip(), left.strip()))
Anyway, once you're happy with this function, you just write a loop that calls it once for each line. For example:
with open("Class 3.txt", "r") as file:
for line in file:
swapped_line = swap_sides(line)
print(swapped_line)

Let's learn how to reverse a single line:
line = `bob : 10`
line.partition(' : ') # ('10', ' : ', 'bob')
''.join(reversed(line.partition(' : ')) # 'bob : 10'
Now, combine with reading lines from a file:
for line in open('Class 3.txt').read().splitlines():
print ''.join(reversed(line.partition(' : '))
Update
I am re-writing the code to read the file, line by line:
with open('Class 3.txt') as input_file:
for line in input_file:
line = line.strip()
print ''.join(reversed(line.partition(' : ')))

Python regular expression for r.findall

I am using findall to separate text.
I started with this expression re.findall(r'(.?)(\$.?\$)' but it doesn't give me the data after the last piece of text found. I missed the '6\n\n'
How do I get the last piece of text?
Here is my python code:
#!/usr/bin/env python
import re
allData = '''
1
2
3 here Some text in here
$file1.txt$
4 Some text in here and more $file2.txt$
5 Some text $file3.txt$ here
$file3.txt$
6
'''
for record in re.findall(r'(.*?)(\$.*?\$)|(.*?$)',allData,flags=re.DOTALL) :
print repr(record)
The output I get for this is:
('\n1\n2\n3 here Some text in here \n', '$file1.txt$', '')
('\n4 Some text in here and more ', '$file2.txt$', '')
('\n5 Some text ', '$file3.txt$', '')
(' here \n', '$file3.txt$', '')
('', '', '\n6\n')
('', '', '')
('', '', '')
I really would like this output:
('\n1\n2\n3 here Some text in here \n', '$file1.txt$')
('\n4 Some text in here and more ', '$file2.txt$')
('\n5 Some text ', '$file3.txt$')
(' here \n', '$file3.txt$')
('\n6\n', '', )
Background info in case you need to see the larger picture.
I case your are interested, I'm re-writing this in python. I have the rest of the code under control. I am just getting too much stuff out of findall.
https://discussions.apple.com/message/21202021#21202021

If I understand correctly from that Apple link you want to do something like:
import re
allData = '''
1
2
3 here Some text in here
$file1.txt$
4 Some text in here and more $file2.txt$
5 Some text $file3.txt$ here
$file3.txt$
6
'''
def read_file(m):
return open(m.group(1)).read()
# Sloppy matching :D
# print re.sub("\$(.*?)\$", read_file, allData)
# More precise.
print re.sub("\$(file\d+?\.txt)\$", read_file, allData)
EDIT As Oscar suggests make match more precise.
ie. take the filename between $s and read the file for the data and that's what the above would do.
Example output:
1
2
3 here Some text in here
I'am file1.txt
4 Some text in here and more
I'am file2.txt
5 Some text
I'am file3.txt
here
I'am file3.txt
6
Files:
==> file1.txt <==
I'am file1.txt
==> file2.txt <==
I'am file2.txt
==> file3.txt <==
I'am file3.txt

To achieve the output you want you need to restrict your pattern to 2 capture groups. (If you use 3 capture groups, you will have 3 elements in every "record").
You could make the second group optional, that should do the job:
r'([^$]*)(\$.*?\$)?'

Here's one way to solve your substitution problem with findall.
def readfile(name):
with open(name) as f:
return f.read()
r = re.compile(r"\$(.+?)\$|(\$|[^$]+)")
print "".join(readfile(filename) if filename else text
for filename, text in r.findall(allData))

This one is partly solving your problem
import re
allData = '''
1
2
3 here Some text in here
$file1.txt$
4 Some text in here and more $file2.txt$
5 Some text $file3.txt$ here
$file3.txt$
6
'''
for record in re.findall(r'(.*?)(\$.*?\$)|(.*?$)',allData.strip(),flags=re.DOTALL) :
print [ x for x in record if x]
producing output
['1\n2\n3 here Some text in here \n', '$file1.txt$']
['\n4 Some text in here and more ', '$file2.txt$']
['\n5 Some text ', '$file3.txt$']
[' here \n', '$file3.txt$']
['\n6']
[]
Avoid last empty list with
for record in re.findall(r'(.*?)(\$.*?\$)|(.*?$)',allData.strip(),flags=re.DOTALL) :
if ([ x for x in record if x] != []):
print [ x for x in record if x]

Parsing through a file from database and adding info to dictionary

I have text file as follows:
HEADER INFO
Last1, First1 Movie1 (1991) random stuff
Movie2 (1992) random stuff
Movie3 (1995) random stuff
Movie4 (3455) random stuff
Last2, First2 Movie1 (1998) random stuff
Movie2 (4568) random stuff
Movie3 (2466) random stuff
Movie4 (4325) random stuff
Movie5 (4875) random stuff
Movie6 (3525) random stuff
Movie7 (4567) random stuff
FOOTER INFO
It also contains some header/footer info that I can skip. The spaces between the name and movie are not constant. I want to add this data into a dictionary using while loops (no for loops for the whole process). Basically the name will act as the key and the list of following movies will be the values (both are strings). So far I can achieve either obtaining the lines which contain the names OR the lines which contain the movies. I tried an using an if statement to get it to work but to no avail.
Basically I was thinking of using an if statement to say if the line contains the name by some characteristic of the line, then splice out the name and splice out the movie and add to the dictionary. And if the name is not in the line, then associate that movie with the same name(multiple entries). But I think this is where Im lost. This part and maybe how Im iterating with the while loop.
I didn't use any readline(). Instead I used readlines() and I used that to toggle through the lines to pick out the information. I'm just wondering if anyone has any tips/hints they could offer.
If anyone wants the actual data I'm using then please pm me.
Ill rephrase it:
CRC: 0xDE308B96 File: actors.list Date: Fri Aug 12 00:00:00 2011
Copyright 1990-2007 The Internet Movie Database, Inc. All rights reserved.
COPYING POLICY: Internet Movie Database (IMDb)
==============================================
CUTTING COPYRIGHT NOTICE
THE ACTORS LIST
===============
Name Titles
---- ------
ActA, A m1 (2011)
m2 (2011)
ActB, B m1 (2011)
m2 (2011)
m3 (2001)
ActC, C m1 (2011)
ActD, D m3 (2003)
m6 (2006)
ActE, E m6 (2006)
ActF, F m4 (2004)
ActG, G m4 (2004)
ActH, H m5 (2005)
Bacon, Kevin m2 (2011)
m5 (2005)
-----------------------------------------------------------------------------
SUBMITTING UPDATES
==================
CUTTING UPDATES
For further info visit http://www.imdb.com/licensing/contact
And basically I want the output to be a dictionary:
{'E Acte': ['m6 (2006)'],
'A Acta': ['m1 (2011)', 'm2 (2011)'],
'G Actg': ['m4 (2004)'],
'B Actb': ['m1 (2011)', 'm2 (2011)', 'm3 (2001)'],
'D Actd': ['m3 (2003)', 'm6 (2006)'],
'F Actf': ['m4 (2004)'],
'Kevin Bacon': ['m2 (2011)', 'm5 (2005)'],
'H Acth': ['m5 (2005)'],
'C Actc': ['m1 (2011)']}
I'm suggested to use while loops since it'll make the process easier, but not restricted solely to it.

Here is another solution for the case when the list is formatted with tab chars instead of spaces:
output = {}
in_list = False
current_name = None
for line in open('actors.list'):
if in_list:
if line.startswith('-'):
break
if '\t' not in line:
continue
name, title = line.split('\t', 1)
name = name.strip()
title = title.strip()
if name:
if ',' in name:
name = name.split(',', 1)
name[0] = name[0].rstrip()
name[1] = name[1].lstrip()
name.reverse()
name = ' '.join(name)
current_name = name
if title:
output.setdefault(
current_name, []).append(title)
else:
if line.startswith('-'):
in_list = True

Here is a solution with a for loop which is much more natural in Python. It assumes the input file is formatted with spaces, like the code posted in the question above. I have posted an alternative answer now for the case when the list is formatted with tabs instead of spaces.
Of course you could rewrite it as a while loop, but it would not make much sense. You can also simplify it a bit by using a defaultdict(list) for the output in newer Python versions.
output = {}
pos = -1 # char position of title column
current_name = None
for line in open('actors.list'):
if pos < 0:
if line.startswith('-'):
pos = line.find(' ')
if pos > 0:
pos = line.find('-', pos)
else:
if line.startswith('-'):
break
name = line[:pos].strip()
title = line[pos:].strip()
if name:
if ',' in name:
name = name.split(',', 1)
name[0] = name[0].rstrip()
name[1] = name[1].lstrip()
name.reverse()
name = ' '.join(name)
current_name = name
if title:
output.setdefault(
current_name, []).append(title)
print output

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python parsing table data into an array - python

split has a maxsplit argument you so you can tell it to split only the first separator, so you could say: cutWhitespace = ' '.join(processStdout.split(None,1)).split("dsget")[0] on your sixth line to tell it to split no more than once.

Related

How to extract data from a list with addtional spaces in between them in python

How to remove duplicates without pandas?

python split and reverse text file

Python regular expression for r.findall

Parsing through a file from database and adding info to dictionary

Categories

Resources