Database Compression in Python

Database Compression in Python - python

I have hourly logs like
user1:joined
user2:log out
user1:added pic
user1:added comment
user3:joined
I want to compress all the flat files down to one file. There are around 30 million users in the logs and I just want the latest user log for all the logs.
My end result is I want to have a log look like
user1:added comment
user2:log out
user3:joined
Now my first attempt on a small scale was to just do a dict like
log['user1'] = "added comment"
Will doing a dict of 30 million key/val pairs have a giant memory footprint.. Or should I use something like sqllite to store them.. then just put the contents of the sqllite table back into a file?

If you intern() each log entry then you'll use only one string for each similar log entry regardless of the number of times it shows up, thereby lowering memory usage a lot.
>>> a = 'foo'
>>> b = 'foo'
>>> a is b
True
>>> b = 'f' + ('oo',)[0]
>>> a is b
False
>>> a = intern('foo')
>>> b = intern('f' + ('oo',)[0])
>>> a is b
True

You could also process the log lines in reverse -- then use a set to keep track of which users you've seen:
s = set()
# note, this piece is inefficient in that I'm reading all the lines
# into memory in order to reverse them... There are recipes out there
# for reading a file in reverse.
lines = open('log').readlines()
lines.reverse()
for line in lines:
line = line.strip()
user, op = line.split(':')
if not user in s:
print line
s.add(user)

The various dbm modules (dbm in Python 3, or anydbm, gdbm, dbhash, etc. in Python 2) let you create simple databases of key to value mappings. They are stored on the disk so there is no huge memory impact. And you can store them as logs if you wish to.

This sounds like the perfect kind of problem for a Map/Reduce solution. See:
http://en.wikipedia.org/wiki/MapReduce
Hadoop
for example.

Its pretty to easy to mock up the data structure to see how much memory it would take.
Something like this where you could change gen_string to generate data that would approximate the messages.
import random
from commands import getstatusoutput as gso
def gen_string():
return str(random.random())
d = {}
for z in range(10**6):
d[gen_string()] = gen_string()
print gso('ps -eo %mem,cmd |grep test.py')[1]
On a one gig netbook:
0.4 vim test.py
0.1 /bin/bash -c time python test.py
11.7 /usr/bin/python2.6 test.py
0.1 sh -c { ps -eo %mem,cmd |grep test.py; } 2>&1
0.0 grep test.py
real 0m26.325s
user 0m25.945s
sys 0m0.377s
... So its using about 10% of 1 gig for 100,000 records
But it would also depend on how much data redundancy you have ...

Thanks to #Ignacio for intern() -
def procLog(logName, userDict):
inf = open(logName, 'r')
for ln in inf.readlines():
name,act = ln.split(':')
userDict[name] = intern(act)
inf.close()
return userDict
def doLogs(logNameList):
userDict = {}
for logName in logNameList:
userDict = procLog(logName, userDict)
return userDict
def writeOrderedLog(logName, userDict):
keylist = userDict.keys()
keylist.sort()
outf = open(logName,'w')
for k in keylist:
outf.write(k + ':' + userDict[k])
outf.close()
def main():
mylogs = ['log20101214', 'log20101215', 'log20101216']
d = doLogs(mylogs)
writeOrderedLog('cumulativeLog', d)
the question, then, is how much memory this will consume.
def makeUserName():
ch = random.choice
syl = ['ba','ma','ta','pre','re','cu','pro','do','tru','ho','cre','su','si','du','so','tri','be','hy','cy','ny','quo','po']
# 22**5 is about 5.1 million potential names
return ch(syl).title() + ch(syl) + ch(syl) + ch(syl) + ch(syl)
ch = random.choice
states = ['joined', 'added pic', 'added article', 'added comment', 'voted', 'logged out']
d = {}
t = []
for i in xrange(1000):
for j in xrange(8000):
d[makeUserName()] = ch(states)
t.append( (len(d), sys.getsizeof(d)) )
which results in
(horizontal axis = number of user names, vertical axis = memory usage in bytes) which is... slightly weird. It looks like a dictionary preallocates quite a lot of memory, then doubles it every time it gets too full?
Anyway, 4 million users takes just under 100MB of RAM - but it actually reallocates around 3 million users, 50MB, so if the doubling holds, you will need about 800MB of RAM to process 24 to 48 million users.

Related

Python consumes excessive memory, doesn't complete run even given adequate memory

I am writing a code for an information retrieval project, which reads Wikipedia pages in XML format from a file, processes the string (I've omitted this part for the sake of simplicity), tokenizes the strings and build positional indexes for the terms found on the pages. Then it saves the indexes to a file using pickle once, and then reads it from that file for the next usages for less processing time (I've included the code for that parts, but they're commented)
After that, I need to fill a 1572 * ~97000 matrix (1572 is the number of Wiki pages, and 97000 is the number of terms found in them. Like each Wiki page is a vector of words, and vectors[i][j] is, the number of occurrences of the i'th word of the word set in the j'th Wiki Page. (Again it's been simplified but it doesn't matter)
The problem is that it takes way too much memory to run the code, and even then, from a point between the 350th and 400th row of the matrix beyond, it doesn't proceed to run the code (it doesn't stop either). I thought the problem was with memory, because when its usage exceeded my 7.7GiB RAM and 1.7GiB swap, it stopped and printed:
Process finished with exit code 137 (interrupted by signal 9: SIGKILL)
But when I added a 6GiB memory by making a swap file for Python3.7 (using the script recommended here, the program didn't run out of memory, but instead got stuck when it had 7.7GiB RAM + 3.9GiB swap memory occupied) as I said, at a point between the 350th and 400th iteration of i in the loop at the bottom. Instead of Ubuntu 18.04,I tried it on Windows 10, the screen simply went black. I tried this on Windows 7, again to no avail.
Next I thought it was a PyCharm issue, so I ran the python file using the python3 file.py command, and it got stuck at the very point it had with PyCharm. I even used the numpy.float16 datatype to save memory, but it had no effect. I asked a colleague about their matrix dimensions, they were similar to mine, but they weren't having problems with it. Is it malware or a memory leak? Or is it something am I doing something wrong here?
import pickle
from hazm import *
from collections import defaultdict
import numpy as np
'''For each word there's one of these. it stores the word's frequency, and the positions it has occurred in on each wiki page'''
class Positional_Index:
def __init__(self):
self.freq = 0
self.title = defaultdict(list)
self.text = defaultdict(list)
'''Here I tokenize words and construct indexes for them'''
# tree = ET.parse('Wiki.xml')
# root = tree.getroot()
# index_dict = defaultdict(Positional_Index)
# all_content = root.findall('{http://www.mediawiki.org/xml/export-0.10/}page')
#
# for page_index, pg in enumerate(all_content):
# title = pg.find('{http://www.mediawiki.org/xml/export-0.10/}title').text
# txt = pg.find('{http://www.mediawiki.org/xml/export-0.10/}revision') \
# .find('{http://www.mediawiki.org/xml/export-0.10/}text').text
#
# title_arr = word_tokenize(title)
# txt_arr = word_tokenize(txt)
#
# for term_index, term in enumerate(title_arr):
# index_dict[term].freq += 1
# index_dict[term].title[page_index] += [term_index]
#
# for term_index, term in enumerate(txt_arr):
# index_dict[term].freq += 1
# index_dict[term].text[page_index] += [term_index]
#
# with open('texts/indices.txt', 'wb') as f:
# pickle.dump(index_dict, f)
with open('texts/indices.txt', 'rb') as file:
data = pickle.load(file)
'''Here I'm trying to keep the number of occurrences of each word on each page'''
page_count = 1572
vectors = np.array([[0 for j in range(len(data.keys()))] for i in range(page_count)], dtype=np.float16)
words = list(data.keys())
word_count = len(words)
const_log_of_d = np.log10(1572)
""" :( """
for i in range(page_count):
for j in range(word_count):
vectors[i][j] = (len(data[words[j]].title[i]) + len(data[words[j]].text[i]))
if i % 50 == 0:
print("i:", i)
Update : I tried this on a friend's computer, this time it killed the process at someplace between the 1350th-1400th iteration.

"pygame.error: Out of memory" when loading level with a large area [duplicate]

Is there a limit to memory for python? I've been using a python script to calculate the average values from a file which is a minimum of 150mb big.
Depending on the size of the file I sometimes encounter a MemoryError.
Can more memory be assigned to the python so I don't encounter the error?
EDIT: Code now below
NOTE: The file sizes can vary greatly (up to 20GB) the minimum size of the a file is 150mb
file_A1_B1 = open("A1_B1_100000.txt", "r")
file_A2_B2 = open("A2_B2_100000.txt", "r")
file_A1_B2 = open("A1_B2_100000.txt", "r")
file_A2_B1 = open("A2_B1_100000.txt", "r")
file_write = open ("average_generations.txt", "w")
mutation_average = open("mutation_average", "w")
files = [file_A2_B2,file_A2_B2,file_A1_B2,file_A2_B1]
for u in files:
line = u.readlines()
list_of_lines = []
for i in line:
values = i.split('\t')
list_of_lines.append(values)
count = 0
for j in list_of_lines:
count +=1
for k in range(0,count):
list_of_lines[k].remove('\n')
length = len(list_of_lines[0])
print_counter = 4
for o in range(0,length):
total = 0
for p in range(0,count):
number = float(list_of_lines[p][o])
total = total + number
average = total/count
print average
if print_counter == 4:
file_write.write(str(average)+'\n')
print_counter = 0
print_counter +=1
file_write.write('\n')

(This is my third answer because I misunderstood what your code was doing in my original, and then made a small but crucial mistake in my second—hopefully three's a charm.
Edits: Since this seems to be a popular answer, I've made a few modifications to improve its implementation over the years—most not too major. This is so if folks use it as template, it will provide an even better basis.
As others have pointed out, your MemoryError problem is most likely because you're attempting to read the entire contents of huge files into memory and then, on top of that, effectively doubling the amount of memory needed by creating a list of lists of the string values from each line.
Python's memory limits are determined by how much physical ram and virtual memory disk space your computer and operating system have available. Even if you don't use it all up and your program "works", using it may be impractical because it takes too long.
Anyway, the most obvious way to avoid that is to process each file a single line at a time, which means you have to do the processing incrementally.
To accomplish this, a list of running totals for each of the fields is kept. When that is finished, the average value of each field can be calculated by dividing the corresponding total value by the count of total lines read. Once that is done, these averages can be printed out and some written to one of the output files. I've also made a conscious effort to use very descriptive variable names to try to make it understandable.
try:
from itertools import izip_longest
except ImportError: # Python 3
from itertools import zip_longest as izip_longest
GROUP_SIZE = 4
input_file_names = ["A1_B1_100000.txt", "A2_B2_100000.txt", "A1_B2_100000.txt",
"A2_B1_100000.txt"]
file_write = open("average_generations.txt", 'w')
mutation_average = open("mutation_average", 'w') # left in, but nothing written
for file_name in input_file_names:
with open(file_name, 'r') as input_file:
print('processing file: {}'.format(file_name))
totals = []
for count, fields in enumerate((line.split('\t') for line in input_file), 1):
totals = [sum(values) for values in
izip_longest(totals, map(float, fields), fillvalue=0)]
averages = [total/count for total in totals]
for print_counter, average in enumerate(averages):
print(' {:9.4f}'.format(average))
if print_counter % GROUP_SIZE == 0:
file_write.write(str(average)+'\n')
file_write.write('\n')
file_write.close()
mutation_average.close()

You're reading the entire file into memory (line = u.readlines()) which will fail of course if the file is too large (and you say that some are up to 20 GB), so that's your problem right there.
Better iterate over each line:
for current_line in u:
do_something_with(current_line)
is the recommended approach.
Later in your script, you're doing some very strange things like first counting all the items in a list, then constructing a for loop over the range of that count. Why not iterate over the list directly? What is the purpose of your script? I have the impression that this could be done much easier.
This is one of the advantages of high-level languages like Python (as opposed to C where you do have to do these housekeeping tasks yourself): Allow Python to handle iteration for you, and only collect in memory what you actually need to have in memory at any given time.
Also, as it seems that you're processing TSV files (tabulator-separated values), you should take a look at the csv module which will handle all the splitting, removing of \ns etc. for you.

Python can use all memory available to its environment. My simple "memory test" crashes on ActiveState Python 2.6 after using about
1959167 [MiB]
On jython 2.5 it crashes earlier:
239000 [MiB]
probably I can configure Jython to use more memory (it uses limits from JVM)
Test app:
import sys
sl = []
i = 0
# some magic 1024 - overhead of string object
fill_size = 1024
if sys.version.startswith('2.7'):
fill_size = 1003
if sys.version.startswith('3'):
fill_size = 497
print(fill_size)
MiB = 0
while True:
s = str(i).zfill(fill_size)
sl.append(s)
if i == 0:
try:
sys.stderr.write('size of one string %d\n' % (sys.getsizeof(s)))
except AttributeError:
pass
i += 1
if i % 1024 == 0:
MiB += 1
if MiB % 25 == 0:
sys.stderr.write('%d [MiB]\n' % (MiB))
In your app you read whole file at once. For such big files you should read the line by line.

No, there's no Python-specific limit on the memory usage of a Python application. I regularly work with Python applications that may use several gigabytes of memory. Most likely, your script actually uses more memory than available on the machine you're running on.
In that case, the solution is to rewrite the script to be more memory efficient, or to add more physical memory if the script is already optimized to minimize memory usage.
Edit:
Your script reads the entire contents of your files into memory at once (line = u.readlines()). Since you're processing files up to 20 GB in size, you're going to get memory errors with that approach unless you have huge amounts of memory in your machine.
A better approach would be to read the files one line at a time:
for u in files:
for line in u: # This will iterate over each line in the file
# Read values from the line, do necessary calculations

Not only are you reading the whole of each file into memory, but also you laboriously replicate the information in a table called list_of_lines.
You have a secondary problem: your choices of variable names severely obfuscate what you are doing.
Here is your script rewritten with the readlines() caper removed and with meaningful names:
file_A1_B1 = open("A1_B1_100000.txt", "r")
file_A2_B2 = open("A2_B2_100000.txt", "r")
file_A1_B2 = open("A1_B2_100000.txt", "r")
file_A2_B1 = open("A2_B1_100000.txt", "r")
file_write = open ("average_generations.txt", "w")
mutation_average = open("mutation_average", "w") # not used
files = [file_A2_B2,file_A2_B2,file_A1_B2,file_A2_B1]
for afile in files:
table = []
for aline in afile:
values = aline.split('\t')
values.remove('\n') # why?
table.append(values)
row_count = len(table)
row0length = len(table[0])
print_counter = 4
for column_index in range(row0length):
column_total = 0
for row_index in range(row_count):
number = float(table[row_index][column_index])
column_total = column_total + number
column_average = column_total/row_count
print column_average
if print_counter == 4:
file_write.write(str(column_average)+'\n')
print_counter = 0
print_counter +=1
file_write.write('\n')
It rapidly becomes apparent that (1) you are calculating column averages (2) the obfuscation led some others to think you were calculating row averages.
As you are calculating column averages, no output is required until the end of each file, and the amount of extra memory actually required is proportional to the number of columns.
Here is a revised version of the outer loop code:
for afile in files:
for row_count, aline in enumerate(afile, start=1):
values = aline.split('\t')
values.remove('\n') # why?
fvalues = map(float, values)
if row_count == 1:
row0length = len(fvalues)
column_index_range = range(row0length)
column_totals = fvalues
else:
assert len(fvalues) == row0length
for column_index in column_index_range:
column_totals[column_index] += fvalues[column_index]
print_counter = 4
for column_index in column_index_range:
column_average = column_totals[column_index] / row_count
print column_average
if print_counter == 4:
file_write.write(str(column_average)+'\n')
print_counter = 0
print_counter +=1

memory dump using Python

I have small program written for me in Python to help me generate all combinations of passwords from a different sets of numbers and words i know for me to recover a password i forgot, as i know all different words and sets of numbers i used i just wanted to generate all possible combinations, the only problem is that the list seems to go on for hours and hours so eventually i run out of memory and it doesn't finish.
I got told it needs to dump my memory so it can carry on but i'm not sure if this is right. is there any way i can get round this problem?
this is the program i am running:
#!/usr/bin/python
import itertools
gfname = "name"
tendig = "1234567890"
sixteendig = "1111111111111111"
housenum = "99"
Characterset1 = "&&&&"
Characterset2 = "££££"
daughternam = "dname"
daughtyear = "1900"
phonenum1 = "055522233"
phonenum2 = "3333333"
mylist = [gfname, tendig, sixteendig, housenum, Characterset1,
Characterset2, daughternam, daughtyear, phonenum1, phonenum2]
for length in range(1, len(mylist)+1):
for item in itertools.permutations(mylist, length):
print "".join(item)
i have taken out a few sets and changed the numbers and word for obvious reasons but this is roughly the program.
another thing is i may be missing a particular word but didnt want to put it in the list because i know it might go before all the generated passwords, does anyone know how to add a prefix to my program.
sorry for the bad grammar and thanks for any help given.

I used guppy to understand the memory usage, I changed the OP code slightly (marked #!!!)
import itertools
gfname = "name"
tendig = "1234567890"
sixteendig = "1111111111111111"
housenum = "99"
Characterset1 = "&&&&"
Characterset2 = u"££££"
daughternam = "dname"
daughtyear = "1900"
phonenum1 = "055522233"
phonenum2 = "3333333"
from guppy import hpy # !!!
h=hpy() # !!!
mylist = [gfname, tendig, sixteendig, housenum, Characterset1,
Characterset2, daughternam, daughtyear, phonenum1, phonenum2]
for length in range(1, len(mylist)+1):
print h.heap() #!!!
for item in itertools.permutations(mylist, length):
print item # !!!
Guppy outputs something like this every time h.heap() is called.
Partition of a set of 25914 objects. Total size = 3370200 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 11748 45 985544 29 985544 29 str
1 5858 23 472376 14 1457920 43 tuple
2 323 1 253640 8 1711560 51 dict (no owner)
3 67 0 213064 6 1924624 57 dict of module
4 199 1 210856 6 2135480 63 dict of type
5 1630 6 208640 6 2344120 70 types.CodeType
6 1593 6 191160 6 2535280 75 function
7 199 1 177008 5 2712288 80 type
8 124 0 135328 4 2847616 84 dict of class
9 1045 4 83600 2 2931216 87 __builtin__.wrapper_descriptor
Running python code.py > code.log and the fgrep Partition code.log shows.
Partition of a set of 25914 objects. Total size = 3370200 bytes.
Partition of a set of 25924 objects. Total size = 3355832 bytes.
Partition of a set of 25924 objects. Total size = 3355728 bytes.
Partition of a set of 25924 objects. Total size = 3372568 bytes.
Partition of a set of 25924 objects. Total size = 3372736 bytes.
Partition of a set of 25924 objects. Total size = 3355752 bytes.
Partition of a set of 25924 objects. Total size = 3372592 bytes.
Partition of a set of 25924 objects. Total size = 3372760 bytes.
Partition of a set of 25924 objects. Total size = 3355776 bytes.
Partition of a set of 25924 objects. Total size = 3372616 bytes.
Which I believe shows that the memory footprint stays fairly consistent.
Granted I may be misinterpreting the results from guppy. Although during my tests I deliberately added a new string to a list to see if the object count increased and it did.
For those interested I had to install guppy like so on OSX - Mountain Lion
pip install https://guppy-pe.svn.sourceforge.net/svnroot/guppy-pe/trunk/guppy
In summary I don't think that it's a running out of memory issue although granted we're not using the full OP dataset.

How about using IronPython and Visual Studio for its debug tools (which are pretty good)? You should be able to pause execution and look at the memory (essentially a memory dump).

Your program will run pretty efficiently by itself, as you now know. But make sure you don't just run it in IDLE, for example; that will slow it down to a crawl as IDLE updates the screen with more and more lines. Save the output directly into a file.
Even better: Have you thought about what you'll do when you have the passwords? If you can log on to the lost account from the command line, try doing that immediately instead of storing all the passwords for later use:
for length in range(1, len(mylist)+1):
for item in itertools.permutations(mylist, length):
password = "".join(item)
try_to_logon(command, password)

To answer the above comment from #shaun if you want the file to output to notepad just run your file like so
Myfile.py >output.txt
If the text file doesn't exist it will be created.
EDIT:
Replace the line in your code at the bottom which reads:
print "" .join(item)
with this:
with open ("output.txt","a") as f:
f.write('\n'.join(items))
f.close
which will produce a file called output.txt.
Should work (haven't tested)

Why is loading this file taking so much memory?

Trying to load a file into python. It's a very big file (1.5Gb), but I have the available memory and I just want to do this once (hence the use of python, I just need to sort the file one time so python was an easy choice).
My issue is that loading this file is resulting in way to much memory usage. When I've loaded about 10% of the lines into memory, Python is already using 700Mb, which is clearly too much. At around 50% the script hangs, using 3.03 Gb of real memory (and slowly rising).
I know this isn't the most efficient method of sorting a file (memory-wise) but I just want it to work so I can move on to more important problems :D So, what is wrong with the following python code that's causing the massive memory usage:
print 'Loading file into memory'
input_file = open(input_file_name, 'r')
input_file.readline() # Toss out the header
lines = []
totalLines = 31164015.0
currentLine = 0.0
printEvery100000 = 0
for line in input_file:
currentLine += 1.0
lined = line.split('\t')
printEvery100000 += 1
if printEvery100000 == 100000:
print str(currentLine / totalLines)
printEvery100000 = 0;
lines.append( (lined[timestamp_pos].strip(), lined[personID_pos].strip(), lined[x_pos].strip(), lined[y_pos].strip()) )
input_file.close()
print 'Done loading file into memory'
EDIT: In case anyone is unsure, the general consensus seems to be that each variable allocated eats up more and more memory. I "fixed" it in this case by 1) calling readLines(), which still loads all the data, but only has one 'string' variable overhead for each line. This loads the entire file using about 1.7Gb. Then, when I call lines.sort(), I pass a function to key that splits on tabs and returns the right column value, converted to an int. This is slow computationally, and memory-intensive overall, but it works. Learned a ton about variable allocation overhad today :D

Here is a rough estimate of the memory needed, based on the constants derived from your example. At a minimum you have to figure the Python internal object overhead for each split line, plus the overhead for each string.
It estimates 9.1 GB to store the file in memory, assuming the following constants, which are off by a bit, since you're only using part of each line:
1.5 GB file size
31,164,015 total lines
each line split into a list with 4 pieces
Code:
import sys
def sizeof(lst):
return sys.getsizeof(lst) + sum(sys.getsizeof(v) for v in lst)
GIG = 1024**3
file_size = 1.5 * GIG
lines = 31164015
num_cols = 4
avg_line_len = int(file_size / float(lines))
val = 'a' * (avg_line_len / num_cols)
lst = [val] * num_cols
line_size = sizeof(lst)
print 'avg line size: %d bytes' % line_size
print 'approx. memory needed: %.1f GB' % ((line_size * lines) / float(GIG))
Returns:
avg line size: 312 bytes
approx. memory needed: 9.1 GB

I don't know about the analysis of the memory usage, but you might try this to get it to work without running out of memory. You'll sort into a new file which is accessed using a memory mapping (I've been led to believe this will work efficiently [in terms of memory]). Mmap has some OS specific workings, I tested this on Linux (very small scale).
This is the basic code, to make it run with a decent time efficiency you'd probably want to do a binary search on the sorted file to find where to insert the line otherwise it will probably take a long time.
You can find a file-seeking binary search algorithm in this question.
Hopefully a memory efficient way of sorting a massive file by line:
import os
from mmap import mmap
input_file = open('unsorted.txt', 'r')
output_file = open('sorted.txt', 'w+')
# need to provide something in order to be able to mmap the file
# so we'll just copy the first line over
output_file.write(input_file.readline())
output_file.flush()
mm = mmap(output_file.fileno(), os.stat(output_file.name).st_size)
cur_size = mm.size()
for line in input_file:
mm.seek(0)
tup = line.split("\t")
while True:
cur_loc = mm.tell()
o_line = mm.readline()
o_tup = o_line.split("\t")
if o_line == '' or tup[0] < o_tup[0]: # EOF or we found our spot
mm.resize(cur_size + len(line))
mm[cur_loc+len(line):] = mm[cur_loc:cur_size]
mm[cur_loc:cur_loc+len(line)] = line
cur_size += len(line)
break

Python script to list users and groups

I'm attempting to code a script that outputs each user and their group on their own line like so:
user1 group1
user2 group1
user3 group2
...
user10 group6
etc.
I'm writing up a script in python for this but was wondering how SO might do this.
p.s. Take a whack at it in any language but I'd prefer python.
EDIT: I'm working on Linux. Ubuntu 8.10 or CentOS =)

For *nix, you have the pwd and grp modules. You iterate through pwd.getpwall() to get all users. You look up their group names with grp.getgrgid(gid).
import pwd, grp
for p in pwd.getpwall():
print p[0], grp.getgrgid(p[3])[0]

the grp module is your friend. Look at grp.getgrall() to get a list of all groups and their members.
EDIT example:
import grp
groups = grp.getgrall()
for group in groups:
for user in group[3]:
print user, group[0]

sh/bash:
getent passwd | cut -f1 -d: | while read name; do echo -n "$name " ; groups $name ; done

The python call to grp.getgrall() only shows the local groups, unlike the call to getgrouplist c function which retruns all users, e.g. also users in sssd that is backed by an ldap but has enumeration turned off. (like in FreeIPA).
After searching for the easiest way to get all groups a users belongs to in python the best way I found was to actually call the getgrouplist c function:
#!/usr/bin/python
import grp, pwd, os
from ctypes import *
from ctypes.util import find_library
libc = cdll.LoadLibrary(find_library('libc'))
getgrouplist = libc.getgrouplist
# 50 groups should be enought?
ngroups = 50
getgrouplist.argtypes = [c_char_p, c_uint, POINTER(c_uint * ngroups), POINTER(c_int)]
getgrouplist.restype = c_int32
grouplist = (c_uint * ngroups)()
ngrouplist = c_int(ngroups)
user = pwd.getpwuid(2540485)
ct = getgrouplist(user.pw_name, user.pw_gid, byref(grouplist), byref(ngrouplist))
# if 50 groups was not enough this will be -1, try again
# luckily the last call put the correct number of groups in ngrouplist
if ct < 0:
getgrouplist.argtypes = [c_char_p, c_uint, POINTER(c_uint *int(ngrouplist.value)), POINTER(c_int)]
grouplist = (c_uint * int(ngrouplist.value))()
ct = getgrouplist(user.pw_name, user.pw_gid, byref(grouplist), byref(ngrouplist))
for i in xrange(0, ct):
gid = grouplist[i]
print grp.getgrgid(gid).gr_name
Getting a list of all users to run this function on similarly would require to figure out what c call is made by getent passwd and call that in python.

a simple function which is capable to deal with the structure of any one of these files (/etc/passwd and /etc/group).
I believe that this code meets your needs, with Python built-in functions and no additional module:
#!/usr/bin/python
def read_and_parse(filename):
"""
Reads and parses lines from /etc/passwd and /etc/group.
Parameters
filename : str
Full path for filename.
"""
data = []
with open(filename, "r") as f:
for line in f.readlines():
data.append(line.split(":")[0])
data.sort()
for item in data:
print("- " + item)
read_and_parse("/etc/group")
read_and_parse("/etc/passwd")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Database Compression in Python - python

The various dbm modules (dbm in Python 3, or anydbm, gdbm, dbhash, etc. in Python 2) let you create simple databases of key to value mappings. They are stored on the disk so there is no huge memory impact. And you can store them as logs if you wish to.

This sounds like the perfect kind of problem for a Map/Reduce solution. See: http://en.wikipedia.org/wiki/MapReduce Hadoop for example.

Related

Python consumes excessive memory, doesn't complete run even given adequate memory

"pygame.error: Out of memory" when loading level with a large area [duplicate]

memory dump using Python

Why is loading this file taking so much memory?

Python script to list users and groups

Categories

Resources