I'm attempting to code a script that outputs each user and their group on their own line like so:
user1 group1
user2 group1
user3 group2
...
user10 group6
etc.
I'm writing up a script in python for this but was wondering how SO might do this.
p.s. Take a whack at it in any language but I'd prefer python.
EDIT: I'm working on Linux. Ubuntu 8.10 or CentOS =)
For *nix, you have the pwd and grp modules. You iterate through pwd.getpwall() to get all users. You look up their group names with grp.getgrgid(gid).
import pwd, grp
for p in pwd.getpwall():
print p[0], grp.getgrgid(p[3])[0]
the grp module is your friend. Look at grp.getgrall() to get a list of all groups and their members.
EDIT example:
import grp
groups = grp.getgrall()
for group in groups:
for user in group[3]:
print user, group[0]
sh/bash:
getent passwd | cut -f1 -d: | while read name; do echo -n "$name " ; groups $name ; done
The python call to grp.getgrall() only shows the local groups, unlike the call to getgrouplist c function which retruns all users, e.g. also users in sssd that is backed by an ldap but has enumeration turned off. (like in FreeIPA).
After searching for the easiest way to get all groups a users belongs to in python the best way I found was to actually call the getgrouplist c function:
#!/usr/bin/python
import grp, pwd, os
from ctypes import *
from ctypes.util import find_library
libc = cdll.LoadLibrary(find_library('libc'))
getgrouplist = libc.getgrouplist
# 50 groups should be enought?
ngroups = 50
getgrouplist.argtypes = [c_char_p, c_uint, POINTER(c_uint * ngroups), POINTER(c_int)]
getgrouplist.restype = c_int32
grouplist = (c_uint * ngroups)()
ngrouplist = c_int(ngroups)
user = pwd.getpwuid(2540485)
ct = getgrouplist(user.pw_name, user.pw_gid, byref(grouplist), byref(ngrouplist))
# if 50 groups was not enough this will be -1, try again
# luckily the last call put the correct number of groups in ngrouplist
if ct < 0:
getgrouplist.argtypes = [c_char_p, c_uint, POINTER(c_uint *int(ngrouplist.value)), POINTER(c_int)]
grouplist = (c_uint * int(ngrouplist.value))()
ct = getgrouplist(user.pw_name, user.pw_gid, byref(grouplist), byref(ngrouplist))
for i in xrange(0, ct):
gid = grouplist[i]
print grp.getgrgid(gid).gr_name
Getting a list of all users to run this function on similarly would require to figure out what c call is made by getent passwd and call that in python.
a simple function which is capable to deal with the structure of any one of these files (/etc/passwd and /etc/group).
I believe that this code meets your needs, with Python built-in functions and no additional module:
#!/usr/bin/python
def read_and_parse(filename):
"""
Reads and parses lines from /etc/passwd and /etc/group.
Parameters
filename : str
Full path for filename.
"""
data = []
with open(filename, "r") as f:
for line in f.readlines():
data.append(line.split(":")[0])
data.sort()
for item in data:
print("- " + item)
read_and_parse("/etc/group")
read_and_parse("/etc/passwd")
Related
I'm looking for ways to make the code more efficient (runtime and memory complexity)
Should I use something like a Max-Heap?
Is the bad performance due to the string concatenation or sorting the dictionary not in-place or something else?
Edit: I replaced the dictionary/map object to applying a Counter method on a list of all retrieved names (with duplicates)
minimal request: script should take less then 30 seconds
current runtime: it takes 54 seconds
# Try to implement the program efficiently (running the script should take less then 30 seconds)
import requests
# Requests is an elegant and simple HTTP library for Python, built for human beings.
# Requests is the only Non-GMO HTTP library for Python, safe for human consumption.
# Requests is not a built in module (does not come with the default python installation), so you will have to install it:
# http://docs.python-requests.org/en/v2.9.1/
# installing it for pyCharm is not so easy and takes a lot of troubleshooting (problems with pip's main version)
# use conda/pip install requests instead
import json
# dict subclass for counting hashable objects
from collections import Counter
#import heapq
import datetime
url = 'https://api.namefake.com'
# a "global" list object. TODO: try to make it "static" (local to the file)
words = []
#####################################################################################
# Calls the site http://www.namefake.com 100 times and retrieves random names
# Examples for the format of the names from this site:
# Dr. Willis Lang IV
# Lily Purdy Jr.
# Dameon Bogisich
# Ms. Zora Padberg V
# Luther Krajcik Sr.
# Prof. Helmer Schaden etc....
#####################################################################################
requests.packages.urllib3.disable_warnings()
t = datetime.datetime.now()
for x in range(100):
# for each name, break it to first and last name
# no need for authentication
# http://docs.python-requests.org/en/v2.3.0/user/quickstart/#make-a-request
responseObj = requests.get(url, verify=False)
# Decoding JSON data from returned response object text
# Deserialize ``s`` (a ``str``, ``bytes`` or ``bytearray`` instance
# containing a JSON document) to a Python object.
jsonData = json.loads(responseObj.text)
x = jsonData['name']
newName = ""
for full_name in x:
# make a string from the decoded python object concatenation
newName += str(full_name)
# split by whitespaces
y = newName.split()
# parse the first name (check first if header exists (Prof. , Dr. , Mr. , Miss)
if "." in y[0] or "Miss" in y[0]:
words.append(y[2])
else:
words.append(y[0])
words.append(y[1])
# Return the top 10 words that appear most frequently, together with the number of times, each word appeared.
# Output example: ['Weber', 'Kris', 'Wyman', 'Rice', 'Quigley', 'Goodwin', 'Lebsack', 'Feeney', 'West', 'Marlen']
# (We don't care whether the word was a first or a last name)
# list of tuples
top_ten =Counter(words).most_common(10)
top_names_list = [name[0] for name in top_ten ]
print((datetime.datetime.now()-t).total_seconds())
print(top_names_list)
You are calling an endpoint of an API that generates dummy information one person at a time - that takes considerable amount of time.
The rest of the code is taking almost no time.
Change the endpoint you are using (there is no bulk-name-gathering on the one you use) or use built-in dummy data provided by python modules.
You can clearly see that "counting and processing names" is not the bottleneck here:
from faker import Faker # python module that generates dummy data
from collections import Counter
import datetime
fake = Faker()
c = Counter()
# get 10.000 names, split them and add 1st part
t = datetime.datetime.now()
c.update( (fake.name().split()[0] for _ in range(10000)) )
print(c.most_common(10))
print((datetime.datetime.now()-t).total_seconds())
Output for 10000 names:
[('Michael', 222), ('David', 160), ('James', 140), ('Jennifer', 134),
('Christopher', 125), ('Robert', 124), ('John', 120), ('William', 111),
('Matthew', 111), ('Lisa', 101)]
in
1.886564 # seconds
General advise for code optimization: measure first then optimize the bottlenecks.
If you need a codereview you can check https://codereview.stackexchange.com/help/on-topic and see if your code fits with the requirements for the codereview stackexchange site. As with SO some effort should be put into the question first - i.e. analyzing where the majority of your time is being spent.
Edit - with performance measurements:
import requests
import json
from collections import defaultdict
import datetime
# defaultdict is (in this case) better then Counter because you add 1 name at a time
# Counter is superiour if you update whole iterables of names at a time
d = defaultdict(int)
def insertToDict(n):
d[n] += 1
url = 'https://api.namefake.com'
api_times = []
process_times = []
requests.packages.urllib3.disable_warnings()
for x in range(10):
# for each name, break it to first and last name
try:
t = datetime.datetime.now() # start time for API call
# no need for authentication
responseObj = requests.get(url, verify=False)
jsonData = json.loads(responseObj.text)
# end time for API call
api_times.append( (datetime.datetime.now()-t).total_seconds() )
x = jsonData['name']
t = datetime.datetime.now() # start time for name processing
newName = ""
for name_char in x:
# make a string from the decoded python object concatenation
newName = newName + str(name_char)
# split by whitespaces
y = newName.split()
# parse the first name (check first if header exists (Prof. , Dr. , Mr. , Miss)
if "." in y[0] or "Miss" in y[0]:
insertToDict(y[2])
else:
insertToDict(y[0])
insertToDict(y[1])
# end time for name processing
process_times.append( (datetime.datetime.now()-t).total_seconds() )
except:
continue
newA = sorted(d, key=d.get, reverse=True)[:10]
print(newA)
print(sum(api_times))
print(sum( process_times ))
Output:
['Ruecker', 'Clare', 'Darryl', 'Edgardo', 'Konopelski', 'Nettie', 'Price',
'Isobel', 'Bashirian', 'Ben']
6.533625
0.000206
You can make the parsing part better .. I did not, because it does not matter.
It is better to use timeit for performance testing (it calls code multiple times and averages, smoothing artifacts due to caching/lag/...) (thx #bruno desthuilliers ) - in this case I did not use timeit because I do not want to call API 100000 times to average results
Terms:
talib: Technical Analysis Library (stock market indicators, charts etc)
CDL: Candle or Candlestick
Short version: I want to run my_lib.some_function() based on the string 'some_function'
On quantopian.com I want to call all of the 60 talib functions that start with CDL, like talib.CDL2CROWS(), in a loop for brevity. First pull the function names as strings, then run the functions by the name matching the string.
Those CDL functions all take the same inputs, lists of open, high, low and closing prices for a time period and the test here just uses a list of length 1 to simplify.
import talib, re
import numpy as np
# Make a list of talib's function names that start with 'CDL'
cdls = re.findall('(CDL\w*)', ' '.join(dir(talib)))
# cdls[:3], the first three like ['CDL2CROWS', 'CDL3BLACKCROWS', 'CDL3INSIDE']
for cdl in cdls:
codeobj = compile(cdl + '(np.array([3]),np.array([4]),np.array([5]),np.array([6]))', 'talib', 'exec')
exec(codeobj)
break
# Output: NameError: name 'CDL2CROWS' is not defined
Try number two:
import talib, re
import numpy as np
cdls = re.findall('(CDL\w*)', ' '.join(dir(talib)))
for cdl in cdls:
codeobj = compile('talib.' + cdl + '(np.array([3]),np.array([4]),np.array([5]),np.array([6]))', '', 'exec')
exec(codeobj)
break
# Output: AssertionError: open is not double
I didn't find that error online.
Related, where I asked the question over there: https://www.quantopian.com/posts/talib-indicators (111 views, no replies yet)
For anyone curious about candlesticks: http://thepatternsite.com/TwoCrows.html
Update
This works, after help in chat from Anzel, possibly floats in the lists were key.
import talib, re
import numpy as np
cdls = re.findall('(CDL\w*)', ' '.join(dir(talib)))
# O, H, L, C = Open, High, Low, Close
O = [ 167.07, 170.8, 178.9, 184.48, 179.1401, 183.56, 186.7, 187.52, 189.0, 193.96 ]
H = [ 167.45, 180.47, 185.83, 185.48, 184.96, 186.3, 189.68, 191.28, 194.5, 194.23 ]
L = [ 164.2, 169.08, 178.56, 177.11, 177.65, 180.5, 185.611, 186.43, 188.0, 188.37 ]
C = [ 166.26, 177.8701, 183.4, 181.039, 182.43, 185.3, 188.61, 190.86, 193.39, 192.99 ]
for cdl in cdls: # the string that becomes the function name
toExec = getattr(talib, cdl)
out = toExec(np.array(O), np.array(H), np.array(L), np.array(C))
print str(out) + ' ' + cdl
Choices on how to add arguments to your string-turned-function:
toExec = getattr(talib, cdl)(args)
toExec()
or
toExec = getattr(talib, cdl)
toExec(args)
A simpler way would be using the abstract lib
import talib
# All the CDL functions are under the Pattern Recognition group
for cdl in talib.get_function_groups()['Pattern Recognition']:
# get the function object
cdl_func = talib.abstract.Function(cdl)
# you can use the info property to get the name of the pattern
print('Checking', cdl_func.info['display_name'], 'pattern')
# run the function as usual
cdl_func(np.array(O), np.array(H), np.array(L), np.array(C))
If you want to run my_lib.some_function() based on the string 'some_function', use getattr like this:
some_function = 'your_string'
toExec = getattr(my_lib, some_function)
# to call the function
toExec()
# an example using math
>>> some_function = 'sin'
>>> toExec = getattr(math, some_function)
>>> toExec
<function math.sin>
>>> toExec(90)
0.8939966636005579
update for your code to run
for cdl in cdls:
toExec = getattr(talib, cdl)
# turns out you need to pass narray as the params
toExec(np.narray(yourlist),np.narray(yourlist),np.narray(yourlist),np.narray(yourlist))
I also suggest you need to review yourlist as it's current 1-dimension, whereas you need n-dimensions array.
I'd like to retrieve the number of users belonging to some Windows UserGroup.
From the documentation of the Python API :
win32net.NetLocalGroupGetMembers(server, group, *level*)
I understand that according to the level param, I'll get differently detailed data, corresponding to Windows LOCALGROUP_MEMBERS_INFO_0, LOCALGROUP_MEMBERS_INFO_1, LOCALGROUP_MEMBERS_INFO_2 or LOCALGROUP_MEMBERS_INFO_3 structures.
Thus, if 93 users belong to the specified userGroup, I expect to always get 93 objects/structures of one of those types.
But my results are quite different. Here's what I get
>>> import win32net
>>> import win32api
>>> server = "\\\\" + win32api.GetComputerName()
>>> users = []
>>> group = u"MyGroup"
>>> (users, total, res) = win32net.NetLocalGroupGetMembers(server, group, 0)
>>> len(users)
93
>>> (users, total, res) = win32net.NetLocalGroupGetMembers(server, group, 1)
>>> len(users)
56
>>> (users, total, res) = win32net.NetLocalGroupGetMembers(server, group, 2)
>>> len(users)
39
>>> (users, total, res) = win32net.NetLocalGroupGetMembers(server, group, 3)
>>> len(users)
68
I expect to get 93 users. And then I want the 93 usernames.
The username is accessible when specifying level=1 and with that param, only 56 are returned.
Any clue ? Thanks.
The call returns different numbers of results due to the size of the data for the requested level.
You can use the returned resume handle to continue fetching the rest, or increase the buffer size to get all results in one call.
Here's the full parameter list from the pywin32 help file:
NetLocalGroupGetMembers(server, groupName , level , resumeHandle , prefLen )
Thanks for your help.
Here's the result :-)
import win32net
import win32api
import win32netcon
server = "\\\\" + win32api.GetComputerName()
users = []
result = []
group = "MyGroup"
handle = 0
level = 1
while True:
(users, total, handle2) = win32net.NetLocalGroupGetMembers(server, group,
level, handle, win32netcon.MAX_PREFERRED_LENGTH)
for u in users:
result.append(u)
if handle2 == 0:
break
else:
handle = handle2
print len(result)
In addition to #Sun Wikong's answer, I made a pip package to get user membership among alot of other functions.
Install with pip install windows_tools.users
Usage:
import windows_tools.users as users
# We use group SID instead of name so we get actual results regardless of system locale
# You can use well_known_sids() for translation, eg
# sid = well_known_sids(username='Administrators')
# or
# usernname = well_known_sids(sid='S-1-5-32-545')
members = users.get_local_group_members(group_sid='S-1-5-32-545')
for member in members:
print(member)
You might also want to check if a user is local admin:
# if no user is given, current one is used
is_admin = is_user_local_admin('myuser')
print(is_admin)
I have hourly logs like
user1:joined
user2:log out
user1:added pic
user1:added comment
user3:joined
I want to compress all the flat files down to one file. There are around 30 million users in the logs and I just want the latest user log for all the logs.
My end result is I want to have a log look like
user1:added comment
user2:log out
user3:joined
Now my first attempt on a small scale was to just do a dict like
log['user1'] = "added comment"
Will doing a dict of 30 million key/val pairs have a giant memory footprint.. Or should I use something like sqllite to store them.. then just put the contents of the sqllite table back into a file?
If you intern() each log entry then you'll use only one string for each similar log entry regardless of the number of times it shows up, thereby lowering memory usage a lot.
>>> a = 'foo'
>>> b = 'foo'
>>> a is b
True
>>> b = 'f' + ('oo',)[0]
>>> a is b
False
>>> a = intern('foo')
>>> b = intern('f' + ('oo',)[0])
>>> a is b
True
You could also process the log lines in reverse -- then use a set to keep track of which users you've seen:
s = set()
# note, this piece is inefficient in that I'm reading all the lines
# into memory in order to reverse them... There are recipes out there
# for reading a file in reverse.
lines = open('log').readlines()
lines.reverse()
for line in lines:
line = line.strip()
user, op = line.split(':')
if not user in s:
print line
s.add(user)
The various dbm modules (dbm in Python 3, or anydbm, gdbm, dbhash, etc. in Python 2) let you create simple databases of key to value mappings. They are stored on the disk so there is no huge memory impact. And you can store them as logs if you wish to.
This sounds like the perfect kind of problem for a Map/Reduce solution. See:
http://en.wikipedia.org/wiki/MapReduce
Hadoop
for example.
Its pretty to easy to mock up the data structure to see how much memory it would take.
Something like this where you could change gen_string to generate data that would approximate the messages.
import random
from commands import getstatusoutput as gso
def gen_string():
return str(random.random())
d = {}
for z in range(10**6):
d[gen_string()] = gen_string()
print gso('ps -eo %mem,cmd |grep test.py')[1]
On a one gig netbook:
0.4 vim test.py
0.1 /bin/bash -c time python test.py
11.7 /usr/bin/python2.6 test.py
0.1 sh -c { ps -eo %mem,cmd |grep test.py; } 2>&1
0.0 grep test.py
real 0m26.325s
user 0m25.945s
sys 0m0.377s
... So its using about 10% of 1 gig for 100,000 records
But it would also depend on how much data redundancy you have ...
Thanks to #Ignacio for intern() -
def procLog(logName, userDict):
inf = open(logName, 'r')
for ln in inf.readlines():
name,act = ln.split(':')
userDict[name] = intern(act)
inf.close()
return userDict
def doLogs(logNameList):
userDict = {}
for logName in logNameList:
userDict = procLog(logName, userDict)
return userDict
def writeOrderedLog(logName, userDict):
keylist = userDict.keys()
keylist.sort()
outf = open(logName,'w')
for k in keylist:
outf.write(k + ':' + userDict[k])
outf.close()
def main():
mylogs = ['log20101214', 'log20101215', 'log20101216']
d = doLogs(mylogs)
writeOrderedLog('cumulativeLog', d)
the question, then, is how much memory this will consume.
def makeUserName():
ch = random.choice
syl = ['ba','ma','ta','pre','re','cu','pro','do','tru','ho','cre','su','si','du','so','tri','be','hy','cy','ny','quo','po']
# 22**5 is about 5.1 million potential names
return ch(syl).title() + ch(syl) + ch(syl) + ch(syl) + ch(syl)
ch = random.choice
states = ['joined', 'added pic', 'added article', 'added comment', 'voted', 'logged out']
d = {}
t = []
for i in xrange(1000):
for j in xrange(8000):
d[makeUserName()] = ch(states)
t.append( (len(d), sys.getsizeof(d)) )
which results in
(horizontal axis = number of user names, vertical axis = memory usage in bytes) which is... slightly weird. It looks like a dictionary preallocates quite a lot of memory, then doubles it every time it gets too full?
Anyway, 4 million users takes just under 100MB of RAM - but it actually reallocates around 3 million users, 50MB, so if the doubling holds, you will need about 800MB of RAM to process 24 to 48 million users.
I'm tying to execute an R script from python, ideally displaying and saving the results. Using rpy2 has been a bit of a struggle, so I thought I'd just call R directly. I have a feeling that I'll need to use something like "os.system" or "subprocess.call," but I am having difficulty deciphering the module guides.
Here's the R script "MantelScript", which uses a particular stat test to compare two distance matrices at a time (distmatA1 and distmatB1). This works in R, though I haven't yet put in the iterating bits in order to read through and compare a bunch of files in a pairwise fashion (I really need some assistance with this, too btw!):
library(ade4)
M1<-read.table("C:\\pythonscripts\\distmatA1.csv", header = FALSE, sep = ",")
M2<-read.table("C:\\pythonscripts\\distmatB1.csv", header = FALSE, sep = ",")
mantel.rtest(dist(matrix(M1, 14, 14)), dist(matrix(M2, 14, 14)), nrepet = 999)
Here's the relevant bit of my python script, which reads through some previously formulated lists and pulls out matrices in order to compare them via this Mantel Test (it should pull the first matrix from identityA and sequentially compare it to every matrix in identityB, then repeat with the second matrix from identityB etc). I want to save these files and then call on the R program to compare them:
# windownA and windownB are lists containing ascending sequences of integers
# identityA and identityB are lists where each field is a distance matrix.
z = 0
v = 0
import subprocess
import os
for i in windownA:
M1 = identityA[i]
z += 1
filename = "C:/pythonscripts/distmatA"+str(z)+".csv"
file = csv.writer(open(filename, 'w'))
file.writerow(M1)
for j in windownB:
M2 = identityB[j]
v += 1
filename2 = "C:/pythonscripts/distmatB"+str(v)+".csv"
file = csv.writer(open(filename2, 'w'))
file.writerow(M2)
## result = os.system('R CMD BATCH C:/R/library/MantelScript.R') - maybe something like this??
## result = subprocess.call(['C:/R/library/MantelScript.txt']) - or maybe this??
print result
print ' '
If your R script only has side effects that's fine, but if you want to process further the results with Python, you'll still be better of using rpy2.
import rpy2.robjects
f = file("C:/R/library/MantelScript.R")
code = ''.join(f.readlines())
result = rpy2.robjects.r(code)
# assume that MantelScript creates a variable "X" in the R GlobalEnv workspace
X = rpy2.rojects.globalenv['X']
Stick with this.
process = subprocess.Popen(['R', 'CMD', 'BATCH', 'C:/R/library/MantelScript.R'])
process.wait()
When the the wait() function returns a value the .R file is finished.
Note that you should write your .R script to produce a file that your Python program can read.
with open( 'the_output_from_mantelscript', 'r' ) as result:
for line in result:
print( line )
Don't waste a lot of time trying to hook up a pipeline.
Invest time in getting a basic "Python spawns R" process working.
You can add to this later.
In case you're interested in generally invoking an R subprocess from Python.
#!/usr/bin/env python3
from io import StringIO
from subprocess import PIPE, Popen
def rnorm(n):
rscript = Popen(["Rscript", "-"], stdin=PIPE, stdout=PIPE, stderr=PIPE)
with StringIO() as s:
s.write("x <- rnorm({})\n".format(n))
s.write("cat(x, \"\\n\")\n")
return rscript.communicate(s.getvalue().encode())
if __name__ == '__main__':
output, errmsg = rnorm(5)
print("stdout:")
print(output.decode('utf-8').strip())
print("stderr:")
print(errmsg.decode('utf-8').strip())
Better to do it through Rscript.
Given what you're trying to do, a pure R solution might be neater:
file.pairs <- combn(dir(pattern="*.csv"), 2) # get every pair of csv files in the current dir
The pairs are columns in a 2xN matrix:
file.pairs[,1]
[1] "distmatrix1.csv" "distmatrix2.csv"
You can run a function on these columns by using apply (with option '2', meaning 'act over columns'):
my.func <- function(v) paste(v[1], v[2], sep="::")
apply(file.pairs, 2, my.func)
In this example my.func just glues the two file names together; you could replace this with a function that does the Mantel Test, something like (untested):
my.func <- function(v){
M1<-read.table(v[1], header = FALSE, sep = ",")
M2<-read.table(v[2], header = FALSE, sep = ",")
mantel.rtest(dist(matrix(M1, 14, 14)), dist(matrix(M2, 14, 14)), nrepet = 999)
}