process large text file in python

process large text file in python - python

I have a very large file (3.8G) that is an extract of users from a system at my school. I need to reprocess that file so that it just contains their ID and email address, comma separated.
I have very little experience with this and would like to use it as a learning exercise for Python.
The file has entries that look like this:
dn: uid=123456789012345,ou=Students,o=system.edu,o=system
LoginId: 0099886
mail: fflintstone#system.edu
dn: uid=543210987654321,ou=Students,o=system.edu,o=system
LoginId: 0083156
mail: brubble#system.edu
I am trying to get a file that looks like:
0099886,fflintstone#system.edu
0083156,brubble#system.edu
Any tips or code?

That actually looks like an LDIF file to me. The python-ldap library has a pure-Python LDIF handling library that could help if your file possesses some of the nasty gotchas possible in LDIF, e.g. Base64-encoded values, entry folding, etc.
You could use it like so:
import csv
import ldif
class ParseRecords(ldif.LDIFParser):
def __init__(self, csv_writer):
self.csv_writer = csv_writer
def handle(self, dn, entry):
self.csv_writer.writerow([entry['LoginId'], entry['mail']])
with open('/path/to/large_file') as input, with open('output_file', 'wb') as output:
csv_writer = csv.writer(output)
csv_writer.writerow(['LoginId', 'Mail'])
ParseRecords(input, csv_writer).parse()
Edit
So to extract from a live LDAP directory, using the python-ldap library you would want to do something like this:
import csv
import ldap
con = ldap.initialize('ldap://server.fqdn.system.edu')
# if you're LDAP directory requires authentication
# con.bind_s(username, password)
try:
with open('output_file', 'wb') as output:
csv_writer = csv.writer(output)
csv_writer.writerow(['LoginId', 'Mail'])
for dn, attrs in con.search_s('ou=Students,o=system.edu,o=system', ldap.SCOPE_SUBTREE, attrlist = ['LoginId','mail']:
csv_writer.writerow([attrs['LoginId'], attrs['mail']])
finally:
# even if you don't have credentials, it's usually good to unbind
con.unbind_s()
It's probably worthwhile reading through the documentation for the ldap module, especially the example.
Note that in the example above, I completely skipped supplying a filter, which you would probably want to do in production. A filter in LDAP is similar to the WHERE clause in a SQL statement; it restricts what objects are returned. Microsoft actually has a good guide on LDAP filters. The canonical reference for LDAP filters is RFC 4515.
Similarly, if there are potentially several thousand entries even after applying an appropriate filter, you may need to look into the LDAP paging control, though using that would, again, make the example more complex. Hopefully that's enough to get you started, but if anything comes up, feel free to ask or open a new question.
Good luck.

Assuming that the structure of each entry will always be the same, just do something like this:
import csv
# Open the file
f = open("/path/to/large.file", "r")
# Create an output file
output_file = open("/desired/path/to/final/file", "w")
# Use the CSV module to make use of existing functionality.
final_file = csv.writer(output_file)
# Write the header row - can be skipped if headers not needed.
final_file.writerow(["LoginID","EmailAddress"])
# Set up our temporary cache for a user
current_user = []
# Iterate over the large file
# Note that we are avoiding loading the entire file into memory
for line in f:
if line.startswith("LoginID"):
current_user.append(line[9:].strip())
# If more information is desired, simply add it to the conditions here
# (additional elif's should do)
# and add it to the current user.
elif line.startswith("mail"):
current_user.append(line[6:].strip())
# Once you know you have reached the end of a user entry
# write the row to the final file
# and clear your temporary list.
final_file.writerow(current_user)
current_user = []
# Skip lines that aren't interesting.
else:
continue

Again assuming your file is well-formed:
with open(inputfilename) as inputfile, with open(outputfilename) as outputfile:
mail = loginid = ''
for line in inputfile:
line = inputfile.split(':')
if line[0] not in ('LoginId', 'mail'):
continue
if line[0] == 'LoginId':
loginid = line[1].strip()
if line[0] == 'mail':
mail = line[1].strip()
if mail and loginid:
output.write(loginid + ',' + mail + '\n')
mail = loginid = ''
Essentially equivalent to the other methods.

To open the file you'll want to use something like the with keyword to ensure it closes properly even if something goes wrong:
with open(<your_file>, "r") as f:
# Do stuff
As for actually parsing out that information, I'd recommend building a dictionary of ID email pairs. You'll also need a variable for the uid and the email.
data = {}
uid = 0
email = ""
To actually parse through the file (the stuff run while your file is open) you can do something like this:
for line in f:
if "uid=" in line:
# Parse the user id out by grabbing the substring between the first = and ,
uid = line[line.find("=")+1:line.find(",")]
elif "mail:" in line:
# Parse the email out by grabbing everything from the : to the end (removing the newline character)
email = line[line.find(": ")+2:-1]
# Given the formatting you've provided, this comes second so we can make an entry into the dict here
data[uid] = email
Using the CSV writer (remember to import csv at the beginning of the file) we can output like this:
writer = csv.writer(<filename>)
writer.writerow("User, Email")
for id, mail in data.iteritems:
writer.writerow(id + "," + mail)
Another option is to open the writer before the file, write the header, then read the lines from the file at the same time as writing to the CSV. This avoids dumping the information into memory, which might be highly desirable. So putting it all together we get
writer = csv.writer(<filename>)
writer.writerow("User, Email")
with open(<your_file>, "r") as f:
for line in f:
if "uid=" in line:
# Parse the user id out by grabbing the substring between the first = and ,
uid = line[line.find("=")+1:line.find(",")]
elif "mail:" in line:
# Parse the email out by grabbing everything from the : to the end (removing the newline character)
email = line[line.find(": ")+2:-1]
# Given the formatting you've provided, this comes second so we can make an entry into the dict here
writer.writerow(iid + "," + email)

Related

Read lines from file and store as variable to use in another function before repeating

I'm farily new to python and I'm currently stuck when trying to improve my script. I have a script that performs a lot of operations using selenium to automate a manual task. The scripts opens two pages, searches for an email, fetches data from that page and sends it to another tab. I need help to feed the script a textfile containing a list of email addresses, one line at a time and using each line to search the webpage. What I need is the following:
Open file "test.txt"
Read first line in text file and store this value for use in another function.
perform function which uses line from text file as its input value.
Add "Completed" behind the first line in the text file before moving to the next
Move to and read next line i text file, store as variable and repeat from step 3.
I'm not sure how I can do this.
Here is a snippet of my code at the time:
def fetchEmail():
fileName = input("Filename: ")
fileNameExt = fileName + ".txt" #to make sure a .txt extension is used
line = f.readline()
for line in f:
print(line) # <-- How can I store the value here for use later?
break
def performSearch():
emailSearch = driver.find_element_by_id('quicksearchinput')
emailSearch.send_keys(fetchEmail, Keys.RETURN) <--- This is where I want to be able to paste current line for everytime function is called.
return main
I would appreciate any help how I can solve this.

It's a bit tricky to diagnose your particular issue, since you don't actually provide real code. However, probably one of the following will help you:
Return the list of all lines from fetchEmail, then search for all of them in send_keys:
def fetchEmail():
fileName = input("Filename: ")
fileNameExt = fileName + ".txt"
with open(fileNameExt) as f:
return f.read().splitlines()
def performSearch():
emailSearch = driver.find_element_by_id('quicksearchinput')
emailSearch.send_keys(fetchEmail(), Keys.RETURN)
# ...
Yield them one at a time, and look for them individually:
def fetchEmail():
fileName = input("Filename: ")
fileNameExt = fileName + ".txt"
with open(fileNameExt) as f:
for line in f:
yield line.strip()
def performSearch():
emailSearch = driver.find_element_by_id('quicksearchinput')
for email in fetchEmail():
emailSearch.send_keys(email, Keys.RETURN)
# ...
I don't recommend using globals, there should be a better way to share information between functions (such as having both of these in a class instance, or having one function call the other like I show above), but here would be an example of how you could save the value when the first function gets called, and retrieve the results in the second function at an arbitrary later time:
emails = []
def fetchEmail():
global emails
fileName = input("Filename: ")
fileNameExt = fileName + ".txt"
with open(fileNameExt) as f:
emails = f.read().splitlines()
def performSearch():
emailSearch = driver.find_element_by_id('quicksearchinput')
emailSearch.send_keys(emails, Keys.RETURN)
# ...

delete a user defined text from a text file in python

def Delete_con():
contact_to_delete= input("choose name to delete from contact")
to_Delete=list(contact_to_delete)
with open("phonebook1.txt", "r+") as file:
content = file.read()
for line in content:
if not any(line in line for line in to_Delete):
content.write(line)
I get zero error. but the line is not deleted. This function ask the user what name he or she wants to delete from the text file.

This should help.
def Delete_con():
contact_to_delete= input("choose name to delete from contact")
contact_to_delete = contact_to_delete.lower() #Convert input to lower case
with open("phonebook1.txt", "r") as file:
content = file.readlines() #Read lines from text
content = [line for line in content if contact_to_delete not in line.lower()] #Check if user input is in line
with open("phonebook1.txt", "w") as file: #Write back content to text
file.writelines(content)

Assuming that:
you want the user to supply just the name, and not the full 'name:number' pair
your phonebook stores one name:number pair per line
I'd do something like this:
import os
from tempfile import NamedTemporaryFile
def delete_contact():
contact_name = input('Choose name to delete: ')
# You probably want to pass path in as an argument
path = 'phonebook1.txt'
base_dir = os.path.dirname(path)
with open(path) as phonebook, \
NamedTemporaryFile(mode='w+', dir=base_dir, delete=False) as tmp:
for line in phonebook:
# rsplit instead of split supports names containing ':'
# if numbers can also contain ':' you need something smarter
name, number = line.rsplit(':', 1)
if name != contact_name:
tmp.write(line)
os.replace(tmp.name, path)
Using a tempfile like this means that if something goes wrong while processing the file you aren't left with a half-written phonebook, you'll still have the original file unchanged. You're also not reading the entire file into memory with this approach.
os.replace() is Python 3.3+ only, if you're using something older you can use os.rename() as long as you're not using Windows.
Here's the tempfile documentation. In this case, you can think of NamedTemporaryFile(mode='w+', dir=base_dir, delete=False) as something like open('tmpfile.txt', mode='w+'). NamedTemporaryFile saves you from having to find a unique name for your tempfile (so that you don't overwrite an existing file). The dir argument creates the tempfile in the same directory as phonebook1.txt which is a good idea because os.replace() can fail when operating across two different filesystems.

delete row from file using csv reader and lists python

there are similar questions on SO to this but none that deal with the specifics that I require.
I have the following code that seeks to delete a row in a file, based on specified user input. The methodology is to
Read file into a list
Delete the relevant row in the list (ideally while reading in the list?)
Over-write file.
It's 2 and 3 that I would like some guidance on as well as comments as to the best solution (for beginners, for teaching/learning purposes) to carry out this sort of simple delete/edit in python with csv reader.
Code
""" ==============TASK
1. Search for any given username
2. Delete the whole row for that particular user
e.g.
Enter username: marvR
>>The record for marvR has been deleted from file.
"""
import csv
#1. This code snippet asks the user for a username and deletes the user's record from file.
updatedlist=[]
with open("fakefacebook.txt",newline="") as f:
reader=csv.reader(f)
username=input("Enter the username of the user you wish to remove from file:")
for row in reader: #for every row in the file
if username not in updatedlist:
updatedlist=row #add each row, line by line, into a list called 'udpatedlist'
print(updatedlist)
#delete the row for the user from the list?
#overwrite the current file with the updated list?
File contents:
username,password,email,no_of_likes
marvR,pass123,marv#gmail.com,400
smithC,open123,cart#gmail.com,200
blogsJ,2bg123,blog#gmail.com,99
Update
Based on an answer below, I have this, but when it overwrites the file, it doesn't update it with the list correctly, not sure why.
import csv
def main():
#1. This code snippet asks the user for a username and deletes the user's record from file.
updatedlist=[]
with open("fakefacebook.txt",newline="") as f:
reader=csv.reader(f)
username=input("Enter the username of the user you wish to remove from file:")
for row in reader: #for every row in the file
if row[0]!=username: #as long as the username is not in the row .......
updatedlist=row #add each row, line by line, into a list called 'udpatedlist'
print(updatedlist)
updatefile(updatedlist)
def updatefile(updatedlist):
with open("fakefacebook.txt","w",newline="") as f:
Writer=csv.writer(f)
Writer.writerow(updatedlist)
print("File has been updated")
main()
It appears to print the updatedfile correctly (as a list) in that it removes the username that is entered. But on writing this to the file, it only prints ONE username to the file.
Any thoughts so I can accept a final answer?

if username not in updatedlist:
To me should be:
if row[0] != username:
Then in a second loop you write updatedlist into your csv file.
I would personnally write everything in another file while reading, then in the end delete the old file and replace it by the new one, which makes it one loop only.
Edit:
replace updatedlist=row with updatedlist.append(row): the first one means overwriting updatedlist with one row while the second one means adding one more row to it.
writerow writes one row, and you give it a list of rows.
Use writerows instead and your writing function will work.
You nearly made it all by yourself, which was my objective.
Some other answers already give you better (faster, cleaner ...) ways, so I won't.

I recommend this approach:
with open("fakefacebook.txt", 'r+') as f:
lines = f.readlines()
f.seek(0)
username = input("Enter the username of the user you wish to remove from file: ")
for line in lines:
if not username in line.split(',')[0]: # e.g. is username == 'marvR', the line containing 'marvR' will not be written
f.write(line)
f.truncate()
All lines from the file are read into lines. Then I go back to the beginning position of the file with f.seek(0). At this point the user is asked for a username, which is then used to check each line before writing back to the file. If the line contains the username specified, it will not be written, thus 'deleting' it. Finally we remove any excess with f.truncate(). I hope this helps, if you have any questions don't hesitate to ask!

I tried to stick to your code: (EDIT: not elegant, but as near as possible to the OPs code)
""" ==============TASK
1. Search for any given username
2. Delete the whole row for that particular user
e.g.
Enter username: marvR
>>The record for marvR has been deleted from file.
"""
import csv
#1. This code snippet asks the user for a username and deletes the user's record from file.
updatedlist=[]
with open("fakefacebook.txt",newline="") as f:
reader=csv.reader(f)
username=input("Enter the username of the user you wish to remove from file:")
content = []
for row in reader: #for every row in the file
content.append(row)
# transpose list
content = list(map(list, zip(*content)))
print(content)
index = [i for i,x in enumerate(content[0]) if x == username]
for sublist in content:
sublist.pop(index[0])
print(content)
# transpose list
content = list(map(list, zip(*content)))
#write back
thefile = open('fakefacebook.txt', 'w')
for item in content:
thefile.write("%s\n" % item)
But I would suggest to use numpy or pandas

Something like this should do you, using the csv module. Since you have structured tabular data with defined columns you should use a DictReader and DictWriter to read and write to/from your file;
import csv
with open('fakefacebook.txt', 'r+') as f:
username = input("Enter the username of the user you wish "
"to remove from file:")
columns = ['username', 'password', 'email', 'no_of_likes']
reader = csv.DictReader(f, columns)
filtered_output = [line for line in reader if line['username'] != username]
f.seek(0)
writer = csv.DictWriter(f, columns)
writer.writerows(filtered_output)
f.truncate()
This opens the input file, filters the out any entries where the username is equal to the desired username to be deleted, and writes what entries are left back out to the input file, overwriting what's already there.

And for another answer: write to a new file and then rename it!
import csv
import os
def main():
username = input("Enter the username of the user you wish to remove from file:")
# check it's not 'username'!
#1. This code snippet asks the user for a username and deletes the user's record from file.
with open("fakefacebook.txt", newline="") as f_in, \
open("fakefacebook.txt.new", "w", newline="") as f_out:
reader = csv.reader(f_in)
writer = csv.writer(f_out)
for row in reader: #for every row in the file
if row[0] != username: # as long as the username is not in the row
writer.writerow(row)
# rename new file
os.rename("fakefacebook.txt.new", "fakefacebook.txt")
main()

Troublesome DAT editing with Python 2.7 and ConfObj/Parser

Edit - final open source code here if interested. https://github.com/qetennyson/ConfigFileBuilder
Hi there, first question here. I'm relatively new to Python (using 2.7 here) and have always been a pretty average programmer as it is.
I'm working on a short program that builds configuration files for these proprietary, internet connected power switches of which I have about 90. Each needs a unique configuration for the city it's going to.
I don't know a ton about filetypes, but these guys are DATs which I figured were similar enough to INI for me to bang my head against the keyboard for six to seven hours, years, eras.
Here's my existing code.
from configobj import ConfigObj
import csv
import os
config = ConfigObj('configtest3.dat', list_values=True, raise_errors=False)
with open('config.csv') as csvfile:
csvreader = csv.reader(csvfile, dialect='excel')
office = 'null'
while (office != 'LASTOFFICEINLIST'):
for row in csvreader:
config.filename = row[0] + '-powerswitch'
config['Hostname'] = row[0]
config['Ipaddr'] = row[2]
config['OutletName1'] = row[3]
config['Gateway'] = row[4]
config['DNS'] = 'DNSTEST'
config['Account1'] = row[6]
config['Password1'] = 'passwordtest'
config['TimeServer1'] = 'x.x.x.x' <--Sanitized
config['TimeZone'] = '800'
office = row[0]
config.write()
You're thinking "that should work fine" and it does, with one exception!
If I don't remove the "Default" from the beginning of the DAT file (shown here):
#The following line must not be removed.
Default
SSID1=MegaBoot_112D36
WebInit=1
...then ConfObj will refuse to read it (I'm assuming it's a key/value issue).
I haven't actually tested one of these devices without the "Default" there in the configuration, but I'm inclined to listen to the line telling me not to remove it, and I don't want to brick a device really.
In my beginner-ness, I did some more research and realized that what I could do would be remove default programmatically, and then add it back after I've done my ConfObj work, but I wanted to check in here first. Plus, I was able to get "Default" out pretty easily:
with open ('configtest3.dat', 'r+') as f:
config = f.readlines()
f.seek(0)
for i in config:
if i != 'Default\n':
f.write(i)
f.truncate()
...but am unsure of how to shoehorn it back in there.
I could be shoehorning this entire thing!
Please enlighten me.
Problem solved! Thank you for your feedback Sweater-Baron.
Here's my final code, which could probably do with some refactoring, but I'll get to that. I finally have a fully functioning config generator!
from configobj import ConfigObj
import csv
import os
config = ConfigObj('configtest.dat', list_values=True, raise_errors=False)
with open('config.csv') as csvfile:
configcsv = csv.reader(csvfile, dialect='excel')
office = 'null'
while (office != 'Wooster'):
for row in configcsv:
config.filename = row[0] + '-powerswitch.dat'
config['Hostname'] = row[0]
config['Ipaddr'] = row[2]
config['OutletName1'] = row[3]
config['Gateway'] = row[4]
config['DNS'] = 'DNSTEST'
config['Account1'] = row[6]
config['Password1'] = 'passwordtest'
config['TimeServer1'] = '10.116.65.17'
config['TimeZone'] = '800'
office = row[0]
config.write()
with open('config.csv') as csvfile:
configcsv = csv.reader(csvfile, dialect='excel')
csvfile.seek(0)
office = 'null'
while (office != 'Wooster'):
for row in configcsv:
filename = row[0] + '-powerswitch.dat'
with open(filename, 'r') as cfgFile:
almostCorrect = cfgFile.read()
correct = "#The following line must not be removed.\nDefault\n" + almostCorrect
with open(filename, 'w') as cfgFile:
cfgFile.write(correct)
print row[0]
office = row[0]
I realized one thing I was getting hung up on was the initial deletion of "Default" at all. Instead, I just deleted that from the base file from which I'm configuring the 90 files. I'm not sure why I thought I needed it in there in the first place, if I just want to add it back in the end!
Thanks again.

This is sort of an intellectually lazy solution, but it seems like it would require minimal changes to your existing code:
Use your existing code to create a config file without the "Default" line. Then read that file back into Python as a string. Then erase everything from the file, add the "Default" line to it, and then write back all the other contents of the file that you just read out of it.
The following code will add "Default" to the beginning of a text file (also not great code, just demonstrating what I mean):
with open("ConfigFilePath", "r") as cfgFile:
almostCorrect = cfgFile.read()
correct = "Default\n" + almostCorrect
# Overwrite the original file with our new string which has "Default" at the beginning:
with open("ConfigFilePath", "w") as cfgFile:
cfgFile.write(correct)

How do I add to a list in a file while also being able to be written to?

I'm trying to do something along the lines of this: I need a way to set up a file and be able to add to a value to it each time the command is used.
All the file format will be this:
{} | {}.format(user.id, list_items)
I don't want list_items to be overwritten. I want them as a list and be able to added to.
Full Code:
with open('test2.txt', 'a+') as f:
newf = new_list_file.png
user.id = message.id
list_files = file.name
f.readlines()
if user.id in f:
f.write(newf)
else:
f.write('{} | [{}]'.format(user.id, newf)
When each new person that uses this command, it will add them in my registry and what file they uploaded. I need the list updated without rewriting the list. Hence why the if statement is there.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

process large text file in python - python

Related

Read lines from file and store as variable to use in another function before repeating

delete a user defined text from a text file in python

delete row from file using csv reader and lists python

Troublesome DAT editing with Python 2.7 and ConfObj/Parser

How do I add to a list in a file while also being able to be written to?

Categories

Resources