Extract data from text file using Python (or any language)

Extract data from text file using Python (or any language) - python

I have a text file that looks like:
First Name Bob
Last name Smith
Phone 555-555-5555
Email bob#bob.com
Date of Birth 11/02/1986
Preferred Method of Contact Text Message
Desired Appointment Date 04/29
Desired Appointment Time 10am
City Pittsburgh
Location State
IP Address x.x.x.x
User-Agent (Browser/OS) Apple Safari 14.0.3 / OS X
Referrer http://www.example.com
First Name john
Last name Smith
Phone 555-555-4444
Email john#gmail.com
Date of Birth 03/02/1955
Preferred Method of Contact Text Message
Desired Appointment Date 05/22
Desired Appointment Time 9am
City Pittsburgh
Location State
IP Address x.x.x.x
User-Agent (Browser/OS) Apple Safari 14.0.3 / OS X
Referrer http://www.example.com
.... and so on
I need to extract each entry to a csv file, so the data should look like: first name, last name, phone, email, etc. I don't even know where to start on something like this.

first of all you'll need to open the text file in read mode.
I'd suggest using a context manager like so:
with open('path/to/your/file.txt', 'r') as file:
for line in file.readlines():
# do something with the line (it is a string)
as for managing the info you could build some intermediate structure, for example a dictionary or a list of dictionaries, and then translate that into a CSV file with the csv module.
you could for example split the file whenever there is a blank line, maybe like this:
with open('Downloads/test.txt', 'r') as f:
my_list = list() # this will be the final list
entry = dict() # this contains each user info as a dict
for line in f.readlines():
if line.strip() == "": # if line is empty start a new dict
my_list.append(entry) # and append the old one to the list
entry = dict()
else: # otherwise split the line and create new dict
line_items = line.split(r' ')
print(line_items)
entry[line_items[0]] = line_items[1]
print(my_list)
this code won't work because your text is not formatted in a consistent way: you need to find a way to make the split between "title" and "content" (like "first name" and "bob") in a consistent way. I suggest maybe looking at regex and fixing the txt file by making spacing more consistent.

assuming the data resides in a:
a="""
First Name Bob
Last name Smith
Phone 555-555-5555
Email bob#bob.com
Date of Birth 11/02/1986
Preferred Method of Contact Text Message
Desired Appointment Date 04/29
Desired Appointment Time 10am
City Pittsburgh
Location State
IP Address x.x.x.x
User-Agent (Browser/OS) Apple Safari 14.0.3 / OS X
Referrer http://www.example.com
First Name john
Last name Smith
Phone 555-555-4444
Email john#gmail.com
Date of Birth 03/02/1955
Preferred Method of Contact Text Message
Desired Appointment Date 05/22
Desired Appointment Time 9am
City Pittsburgh
Location State
IP Address x.x.x.x
User-Agent (Browser/OS) Apple Safari 14.0.3 / OS X
Referrer http://www.example.com
"""
line_sep = "\n" # CHANGE ME ACCORDING TO DATA
fields = ["First Name", "Last name", "Phone",
"Email", "Date of Birth", "Preferred Method of Contact",
"Desired Appointment Date", "Desired Appointment Time",
"City", "Location", "IP Address", "User-Agent","Referrer"]
records = a.split(line_sep * 2)
all_records = []
for record in records:
splitted_record = record.split(line_sep)
one_record = {}
csv_record = []
for f in fields:
found = False
for one_field in splitted_record:
if one_field.startswith(f):
data = one_field[len(f):].strip()
one_record[f] = data
csv_record.append(data)
found = True
if not found:
csv_record.append("")
all_records.append(";".join(csv_record))
one_record will have the record as dictionary and csv_record will have it as a list of fields (ordered as fields variable)

Edited to add: ignore this answer, the code from Koko Jumbo looks infinitely more sensible and actually gives you a CVS file at the end of it! It was a fun exercise though :)
Just to expand on fcagnola's code a bit.
If it's a quick and dirty one-off, and you know that the data will be consistently presented, the following should work to create a list of dictionaries with the correct key/value pairing. Each line is processed by splitting the line and comparing the line number (reset to 0 with each new dict) against an array of values that represent where the boundary between key and value falls.
For example, "First Name Bob" becomes ["First","Name","Bob"]. The function has been told that linenumber= 0 so it checks entries[linenumber] to get the value "2", which it uses to join the key name (items 0 & 1) and then join the data (items 2 onwards). The end result is ["First Name", "Bob"] which is then added to the dictionary.
class Extract:
def extractEntry(self,linedata,lineindex):
# Hardcoded list! The quick and dirty part.
# This is specific to the example data provided. The entries
# represent the index to be used when splitting the string
# between the key and the data
entries = (2,2,1,1,3,4,3,3,1,1,2,2,1)
return self.createNewEntry(linedata,entries[lineindex])
def createNewEntry(self,linedata,dataindex):
list_data = linedata.split()
key = " ".join(list_data[:dataindex])
data = " ".join(list_data[dataindex:])
return [key,data]
with open('test.txt', 'r') as f:
my_list = list() # this will be the final list
entry = dict() # this contains each user info as a dict
extr = Extract() # class for splitting the entries into key/value
x = 0
for line in f.readlines():
if line.strip() == "": # if line is empty start a new dict
my_list.append(entry) # and append the old one to the list
entry = dict()
x = 0
else: # otherwise split the line and create new dict
extracted_data = extr.extractEntry(line,x)
entry[extracted_data[0]] = extracted_data[1]
x += 1
my_list.append(entry)
print(my_list)

Related

How to do search by option to search from files? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I'm a beginner trying to build a simple library management system using Python. Users can search a book from a list of many books stored in a text file. Here is an example of what is in the text file:
Author: J.K Rowling
Title: Harry Potter and the Deathly Hollow
Keywords: xxxx
Published by: xxxx
Published year: xxxx
Author: Stephen King
Title: xxxx
Keywords: xxxx
Published by: xxxx
Published year: xxxx
Author: J.K Rowling
Title: Harry Potter and the Half Blood Prince
Keywords: xxxx
Published by: xxxx
Published year: xxxx
This is where it gets difficult for me. There is a Search by Author option for the user to search books. What I want to do is when the users search for any authors (e.g. J.K Rowling), it would output all (in this case, there are two J.K Rowling books) of the related components (Author, Title, Keywords, Published by, Published year). This is the last piece of the program, which I'm having very much difficulty in doing. Please help me, and thank you all in advance.

Is it possible for you to implement the text file in the form of a JSON file instead? It could be a better alternative since you could easily access all the values depending on the key you have chosen and search through those as well.
{
"Harry Potter and the Deathly Hollow" :
{
"Author": "J.K Rowling",
"Keywords": xxxx,
"Published by": xxxx,
"Published year": xxxx
},
'Example 2' :
{
"Author": "Stephen King"
"Keywords": xxxx
"Published by": xxxx
"Published year": xxxx
}
}

You can iterate through the lines of the text file like this:
with open(r"path\to\text_file.txt", "r") as books:
lines = books.readlines()
for index in range(len(lines)):
line = lines[index]
Now, get the author of each book by splitting the line on the ":" character and testing if the first part == "Author". Then, get the second part of the split string and strip it of the "\n" [newline] and " " characters to make sure there are no extra spaces or anything that will mess up the search on either side. I would also recomment lowercasing the author name and search query to make capitalisation not matter. Test if this is equal to the search query:
if line.split(":")[0] == "Author" and\
line.split(":")[1].strip("\n ").lower() == search_query.lower():
Then, in this if loop, print out all the required information about this book.
Completed code:
search_query = "J.K Rowling"
with open(r"books.txt", "r") as books:
lines = books.readlines()
for index in range(len(lines)):
line = lines[index]
if line.split(":")[0] == "Author" and line.split(":")[1].strip("\n ").lower() == search_query.lower():
print(*lines[index + 1: index + 5])

Generally, a lot of problems to be programmed can be resolved into a three-step process:
Read the input into an internal data structure
Do processing as required
Write the output
This problem seems like quite a good fit for that pattern:
In the first part, read the text file into an in-memory list of either dictionaries or objects (depending on what's expected by your course)
In the second part, search the in-memory list according to the search criteria; this will result in a shorter list containing the results
In the third part, print out the results neatly
It would be reasonable to put these into three separate functions, and to attack each of them separately

# To read the details from the file ex books.txt
with open("books.txt","r") as fd:
lines = fd.read()
#Split the lines based on Author. As Author word will be missing after split so add the Author to the result. The entire result is in bookdetails list.
bookdetails = ["Author" + line for line in lines.split("Author")[1:]]
#Author Name to search
authorName = "J.K Rowling"
# Search for the given author name from the bookdetails list. Split the result based on new line results in array of details.
result = [book.splitlines() for book in bookdetails if "Author: " + authorName in book]
print(result)

If you will always receive this format of the file and you want to transform it into a dictionary:
def read_author(file):
data = dict()
with open(file, "r") as f:
li = f.read().split("\n")
for e in li:
if ":" in e:
data[e.split(":")[0]] = e.split(":")[1]
return data['Author']
Note: The text file sometimes has empty lines so I check if the line contains the colon (:) before transforming it into a dict.
Then if you want a more generic method you can pass the KEY of the element you want:
def read_info(file, key):
data = dict()
with open(file, "r") as f:
li = f.read().split("\n")
for e in li:
if ":" in e:
data[e.split(":")[0]] = e.split(":")[1]
return data[key]
Separating the reading like the following you can be more modular:
class BookInfo:
def __init__(self, file) -> None:
self.file = file
self.data = None
def __read_file(self):
if self.data is None:
with open(self.file, "r") as f:
li = f.read().split("\n")
self.data = dict()
for e in li:
if ":" in e:
self.data[e.split(":")[0]] = e.split(":")[1]
def read_author(self):
self.__read_file()
return self.data['Author']
Then create objects for each book:
info = BookInfo("book.txt")
print(info.read_author())

How to pull specific parts of a list on each line?

I have a list that spits out information like this: ['username', 'password'], ['username', 'password'], ['username', 'password'], and so on..
I would like to be able to pull a specific username and password later on.
For example:
['abc', '9876'], ['xyz', '1234']
pull abc and tell them the password is 9876.
Then pull xyz and tell them the password is 1234
I tried messing around with the list and I am just drawing a blank on how to do this.
lines = []
with open("output.txt", "r") as f:
for line in f.readlines():
if 'Success' in line:
#get rid of everything after word success so only username and password is printed out
lines.append(line[:line.find("Success")-1])
for element in lines:
#split username and password up at : so they are separate entities
#original output was username:password, want it to be username, password
parts = element.strip().split(":")
print(parts)
I want to pull each username and then pull their password as described above
Current output after running through this is ['username', 'password']. The original output file had extra information that I got rid of which is what the code involving 'Success' took care of
I would like to do this without hardcoding a username in to it. I am trying to automate this process so that it runs through every username and formats it to say, "hi [username}, your password is [123]", for all of the usernames
I then later would like to be able to only tell the specific user their password. For example, i want to send an email to user abc. that email should only contain the username and password of user abc

Instead of printing parts, append them to a list.
data = []
for element in lines:
parts = element.strip().split(":")
data.append(parts)
Then you could convert these into a dictionary for lookup
username_passwords = dict(data)
print(username_passwords['abc'])

If I am understanding this correctly parts is the list that contains [Username:Password]. If that is the case we can assign each value of parts which should only have 2 elements in it to a dictionary as a dictionary pair and then call the username later on.
lines = []
User_Pass = {}
with open("output.txt", "r") as f:
for line in f.readlines():
if 'Success' in line:
#get rid of everything after word success so only username and password is printed out
lines.append(line[:line.find("Success")-1])
for element in lines:
#split username and password up at : so they are separate entities
parts = element.strip().split(":")
User_Pass.update({parts[0] : parts[1]})
Then you can call the password from the username as follows if you know the username:
x = User_Pass["foo"]
Or as you stated in the comments:
for key, value in User_Pass.items():
print('Username ' + key + ' Has a Password of ' + value)

it looks like after you do this
lines.append(line[:line.find("Success")-1])
lines = ['username:password', 'username:password'...]
so I would do this
new_list_of_lists = [element.strip().split(":") for element in lines]
new_list_of_lists should now look like [[username, password], [username, password]]
then just do this:
dict_of_usernames_and_passwords = dict(new_list_of_lists)
with a dict you can have now retrieve passwords using usernames. like:
dict_of_usernames_and_passwords['abc']
you can save the dict, using json module, to a file, for easy retrieval.

Converting a text file into csv file using python

I have a requirement where in I need to convert my text files into csv and am using python for doing it. My text file looks like this ,
Employee Name : XXXXX
Employee Number : 12345
Age : 45
Hobbies: Tennis
Employee Name: xxx
Employee Number :123456
Hobbies : Football
I want my CSV file to have the column names as Employee Name, Employee Number , Age and Hobbies and when a particular value is not present it should have a value of NA in that particular place. Any simple solutions to do this? Thanks in advance

You can do something like this:
records = """Employee Name : XXXXX
Employee Number : 12345
Age : 45
Hobbies: Tennis
Employee Name: xxx
Employee Number :123456
Hobbies : Football"""
for record in records.split('Employee Name'):
fields = record.split('\n')
name = 'NA'
number = 'NA'
age = 'NA'
hobbies = 'NA'
for field in fields:
field_name, field_value = field.split(':')
if field_name == "": # This is employee name, since we split on it
name = field_value
if field_name == "Employee Number":
number = field_value
if field_name == "Age":
age = field_value
if field_name == "Hobbies":
hobbies = field_value
Of course, this method assumes that there is (at least) Employee Name field in every record.

Maybe this helps you get started? It's just the static output of the first employee data. You would now need to wrap this into some sort of iteration over the file. There is very very likely a more elegant solution, but this is how you would do it without a single import statement ;)
with open('test.txt', 'r') as f:
content = f.readlines()
output_line = "".join([line.split(':')[1].replace('\n',';').strip() for line in content[0:4]])
print(output_line)

I followed very simple steps for this and may not be optimal but solves the problem. Important case here I can see is there can be multiple keys ("Employee Name" etc) in single file.
Steps
Read txt file to list of lines.
convert list to dict(logic can be more improved or complex lambdas can be added here)
Simply use pandas to convert dict to csv
Below is the code,
import pandas
etxt_file = r"test.txt"
txt = open(txt_file, "r")
txt_string = txt.read()
txt_lines = txt_string.split("\n")
txt_dict = {}
for txt_line in txt_lines:
k,v = txt_line.split(":")
k = k.strip()
v = v.strip()
if txt_dict.has_key(k):
list = txt_dict.get(k)
else:
list = []
list.append(v)
txt_dict[k]=list
print pandas.DataFrame.from_dict(txt_dict, orient="index")
Output:
0 1
Employee Number 12345 123456
Age 45 None
Employee Name XXXXX xxx
Hobbies Tennis Football
I hope this helps.

How do I merge two csv files?

I have two csv files. EMPLOYEES contains a dict of every employee at a company with 10 rows of information about each one. SOCIAL contains a dict of employees who filled out a survey, with 8 rows of information. Every employee in survey is also on the master dict. Both dicts have a unique identifier (the EXTENSION.)
I want to say "If an employee is on the SOCIAL dict, add rows 4,5,6 to their column in the EMPLOYEES dict" In other words, if an employee filled out a survey, additional information should be appended to the master dict.
Currently, my program pulls out all information from EMPLOYEES for employees who have taken the SURVEY. But I don't know how to add the additional rows of information to the EMPLOYEES csv. I have spent much of the day reading StackOverflow about DictReader and Dictionary and am still confused.
Thank you in advance for your guidance.
Sample EMPLOYEE:
Name Extension Job
Bill 1111 plumber
Alice 2222 fisherman
Carl 3333 rodeo clown
Sample SURVEY:
Extension Favorite Color Book
2222 blue A Secret Garden
3333 green To Kill a Mockingbird
Sample OUTPUT
Name Extension Job Favorite Color Favorite Book
Bill 1111 plumber
Alice 2222 fisherman blue A Secret Garden
Carl 3333 rodeo clown green To Kill a Mockingbird
import csv
with open('employees.csv', "rU") as npr_employees:
employees = csv.DictReader(npr_employees)
all_employees = {}
total_employees = {}
for employee in employees:
all_employees[employee['Extension']] = employee
with open('social.csv', "rU") as social_employees:
social_employee = csv.DictReader(social_employees)
for row in social_employee:
print all_employees.get(row['Extension'], None)

You can merge two dictionaries in Python using:
dict(d1.items() + d2.items())
Using a dict, all_employees, with the key as 'Extension' works perfectly to link a "social employee" row with its corresponding "employee" row.
Then you need to go through all the updated employee info and output their fields in a consistent order. Since dictionaries are inherently orderless, we keep a list of the headers, output_headers as we see them.
import csv
# Store all the info about the employees
all_employees = {}
output_headers = []
# First, get all employee record info
with open('employees.csv', 'rU') as npr_employees:
employees = csv.DictReader(npr_employees)
for employee in employees:
ext = employee['Extension']
all_employees[ext] = employee
# Add headers from "all employees"
output_headers.extend(employees.fieldnames)
# Then, get all info from social, and update employee info
with open('social.csv', 'rU') as social_employees:
social_employees = csv.DictReader(social_employees)
for social_employee in social_employees:
ext = social_employee['Extension']
# Combine the two dictionaries.
all_employees[ext] = dict(
all_employees[ext].items() + social_employee.items()
)
# Add headers from "social employees", but don't add duplicate fields
output_headers.extend(
[field for field in social_employees.fieldnames
if field not in output_headers]
)
# Finally, output the records ordered by extension
with open('output.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerow(output_headers)
# Write the new employee rows. If a field doesn't exist,
# write an empty string.
for employee in sorted(all_employees.values()):
writer.writerow(
[employee.get(field, '') for field in output_headers]
)
outputs:
Name,Extension,Job,Favorite Color,Book
Bill,1111,plumber,,
Alice,2222,fisherman,blue,A Secret Garden
Carl,3333,rodeo clown,green,To Kill a Mockingbird
Let me know if you have any questions!

You Could try:
for row in social_employee:
employee = all_employees.get(row['Extension'], None)
if employee is not None:
all_employees[employee['additionalinfo1']] = row['additionalinfo1']
all_employees[employee['additionalinfo2']] = row['additionalinfo2']

Python script to create multiple users in a CSV file and generate email addresses

I want to create a csvfile that has multiple users and at the same time create email addresses for this users using their last names. I am using python for this but I can't get it to create the e-mail address in the list. My script is below, what am I missing?
import csv
First_Name = ["Test"]
Last_Name = ["User%d" % i for i in range (1,10)]
Email_Address = 'Last_Name' [("#myemail.com")]
Password = ["Password1"]
# open a file for writing.
csv_out = open('mycsv.csv', 'wb')
# create the csv writer object.
mywriter = csv.writer(csv_out)
# all rows at once.
rows =zip(Email_Address, Password, First_Name, Last_Name,)
mywriter.writerows(rows)
csv_out.close()

Make
Email_Address = 'Last_Name' [("#myemail.com")]
into
Email_Address = [x + "#myemail.com" for x in Last_Name]
to create a list of all email addresses based on all last names. This assumes you intended for all of your variables to be lists.
Even though this will create ten emails (one for each last name) your file will only have one row written to it. This is because zip will stop iteration at the length of the shortest list you pass it. Currently First_Name and Password each contain only one item.

I'm basically guessing since you haven't said anything about what errors you're getting, but the most obvious problem I can see is that you're trying to add a string to a list of tuples, which doesn't make a lot of sense.
'Last_Name' [("#myemail.com")]
should be:
'Last_Name' + "#myemail.com"
Now, as far as what you're actually trying to do, which is extremely unclear, I think you want to use a series of list comprehensions. For example:
users = [i for i in range(0, 10)]
first_names = ["test"+str(user) for user in users]
last_names = ["User%d" %user for user in users]
email_addresses = [last_name + "#myemail.com" for last_name in last_names]
passwords = ["Password1" for user in users]
with open('mycsv.csv', 'wb') as csv_out:
writer = csv.writer(csv_out)
writer.writerows(zip(email_addresses, passwords, first_names, last_names))
output:
User0#myemail.com,Password1,test0,User0
User1#myemail.com,Password1,test1,User1
User2#myemail.com,Password1,test2,User2
User3#myemail.com,Password1,test3,User3
User4#myemail.com,Password1,test4,User4
User5#myemail.com,Password1,test5,User5
User6#myemail.com,Password1,test6,User6
User7#myemail.com,Password1,test7,User7
User8#myemail.com,Password1,test8,User8
User9#myemail.com,Password1,test9,User9

Your zip() will only produce a list w/ 1 item b/c First_Name and Password explicitly each contain only 1 item.
How about this, avoiding the zip entirely:
with open('mycsv.csv', 'wb') as csv_out:
writer = csv.writer(csv_out)
for i in xrange(1,9):
writer.writerow( ["User%d#myemail.com"%i, "Password%d"%i, "test%d"%i, "User%d"%i] )

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract data from text file using Python (or any language) - python

Related

How to do search by option to search from files? [closed]

How to pull specific parts of a list on each line?

Converting a text file into csv file using python

How do I merge two csv files?

Python script to create multiple users in a CSV file and generate email addresses

Categories

Resources