Extracting data from a text file with Python

Extracting data from a text file with Python - python

So I have a large text file. It contains a bunch of information in the following format:
|NAME|NUMBER(1)|AST|TYPE(0)|TYPE|NUMBER(2)||NUMBER(3)|NUMBER(4)|DESCRIPTION|
Sorry for the vagueness. All the information is formatted like the above and between each descriptor is the separator '|'. I want to be able to search the file for the 'NAME' and the print each descriptor in it's own tag such as this example:
Name
Number(1):
AST:
TYPE(0):
etc....
In case I'm still confusing, I want to be able to search the name and then print out the information that follows each being separated by a '|'.
Can anyone help?
EDIT
Here is an example of a part of the text file:
|Trevor Jones|70|AST|White|Earth|3||500|1500|Old Man Living in a retirement home|
This is the code I have so far:
with open('LARGE.TXT') as fd:
name='Trevor Jones'
input=[x.split('|') for x in fd.readlines()]
to_search={x[0]:x for x in input}
print('\n'.join(to_search[name]))

First you need to break the file up somehow. I think that a dictionary is the best option here. Then you can get what you need.
d = {}
# Where `fl` is our file object
for L in fl:
# Skip the first pipe
detached = L[1:].split('|')
# May wish to process here
d[detached[0]] = detached[1:]
# Can do whatever with this information now
print d.get('string_to_search')

Something like
#Opens the file in a 'safe' manner
with open('large_text_file') as fd:
#This reads in the file and splits it into tokens,
#the strip removes the extra pipes
input = [x.strip('|').split('|') for x in fd.readlines()]
#This makes it into a searchable dictionary
to_search = {x[0]:x for x in input}
and then search with
to_search[NAME]
Depending on the format you want the answers in use
print ' '.join(to_search[NAME])
or
print '\n'.join(to_search[NAME])
A word of warning, this solution assumes that the names are unique, if they aren't a more complex solution may be required.

Related

Is there a way to select certain elements from a separate text file in Python?

I have started to make an anagram solver with Python, but since I was using a lot of words (every word in the English dictionary), I didn't want them as an array in my Python file and instead had them in a separate text file called dictionaryArray2.txt.
I can easily import my text file and display my words on the screen using Python but I cannot find a way to select a specific element from the array rather than displaying them all.
When I do print(dictionary[2]) it prints the second letter of the whole file rather than the second element of the array. I never get any errors. it just doesn't work.
I have tried multiple things but they all have the same output.
My code below:
f = open("dictionaryArray2.txt", "r")
dictionary = f.read()
f.close()
print(dictionary[2])

If you want to split the content of dictionaryArray2 into separate words, do:
f = open("dictionaryArray2.txt", "r")
dictionary = f.read()
f.close()
print(dictionary[2])
If you want to split the content of dictionaryArray2 into separate lines, do:
f = open("dictionaryArray2.txt", "r")
dictionary = f.readlines()
f.close()
words = dictionary.split()
print(words[2])

I think the problem is, you're reading the entire file into a single long list. If your input dictionary is one word per line, I think what you want is to get a text file like this:
apple
bat
To something like this:
dictionary = ['apple', 'bat']
There's an existing answer that might offer some useful code examples, but in brief, f.read() will read the entire file object f. f.readlines(), on the other hand, iterates over each line one at a time.
To quote from the official Python docs:
If you want to read all the lines of a file in a list you can also use list(f) or f.readlines().

please explain how to use split in this piece of homework

Instructions:
If we don't know how many items are in a file, we can use read() to load the entire file and then use the line endings (the \n bits) to split it into lines. Here is an example of how to use split()
source = 'aaa,bbb,ccc'
things = source.split(',') # split at every comma
print(things) # displays ['aaa', 'bbb', 'ccc'] because things is a list
Task
Ask the user to enter names and keep asking until they enter nothing.
Add each new name to a file called names.txt as they are entered.
Hint: Open the file before the loop and close it after the loop
Once they have stopped entering names, load the file contents, split it into individual lines and print the lines one by one with -= before the name and =- after it.

Assuming you already have the file of names called input.txt like so:
Tom
Dick
Harry
Then the code to read the entire file, split on '\n' and print each name in the required format is
# Open the file
with open("input.txt") as file:
# Read the entire file
content = file.read()
# Split the content up into individual names
names = content.split('\n')
# Print the required string for each name
for name in names:
print(f'-={name}=-')
Hope that solves your problem.

How to write clean data to a file in python in tabulated format

Issue: Remove the hyperlinks, numbers and signs like ^&*$ etc from twitter text. The tweet file is in CSV tabulated format as shown below:
s.No. username tweetText
1. #abc This is a test #abc example.com
2. #bcd This is another test #bcd example.com
Being a novice at python, I search and string together the following code, thanks to a the code given here:
import re
fileName="path-to-file//tweetfile.csv"
fileout=open("Output.txt","w")
with open(fileName,'r') as myfile:
data=myfile.read().lower() # read the file and convert all text to lowercase
clean_data=' '.join(re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",data).split()) # regular expression to strip the html out of the text
fileout.write(clean_data+'\n') # write the cleaned data to a file
fileout.close()
myfile.close()
print "All done"
It does the data stripping, but the output file format is not as I desire. The output text file is in a single line like
s.no username tweetText 1 abc This is a cleaned tweet 2 bcd This is another cleaned tweet 3 efg This is yet another cleaned tweet
How can I fix this code to give me an output like given below:
s.No. username tweetText
1 abc This is a test
2 bcd This is another test
3 efg This is yet another test
I think something needs to be added in the regular expression code but I don't know what it could be. Any pointers or suggestions will be helpful.

You can read the line, clean it, and write it out in one loop. You can also use the CSV module to help you build out your result file.
import csv
import re
exp = r"(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"
def cleaner(row):
return [re.sub(exp, " ", item.lower()) for item in row]
with open('input.csv', 'r') as i, open('output.csv', 'wb') as o:
reader = csv.reader(i, delimiter=',') # Comma is the default
writer = csv.writer(o, delimiter=',')
# Take the first row from the input file (the header)
# and write it to the output file
writer.writerow(next(reader))
for row in reader:
writer.writerow(cleaner(row))
The csv module knows correctly how to add separators between items; as long as you pass it a collection of items.
So, what the cleaner method does it take each item (column) in the row from the input file, apply the substitution to the lowercase version of the item; and then return back a list.
The rest of the code is simply opening the file, configuring the CSV module with the separators you want for the input and output files (in the example code, the separator for both files is a tab, but you can change the output separator).
Next, the first row of the input file is read and written out to the output file. No transformation is done on this row (which is why it is not in the loop).
Reading the row from the input file automatically puts the file pointer on the next row - so then we simply loop through the input rows (in reader), for each row apply the cleaner function - this will return a list - and then write that list back to the output file with writer.writerow().

instead of applying the re.sub() and the .lower() expressions to the entire file at once try iterating over each line in the CSV file like this:
for line in myfile:
line = line.lower()
line = re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",line)
fileout.write(line+'\n')
also when you use the with <file> as myfile expression there is no need to close it at the end of your program, this is done automatically when you use with

Try this regex:
clean_data=' '.join(re.sub("[#\^&\*\$]|#\S+|\S+[a-z0-9]\.(com|net|org)"," ",data).split()) # regular expression to strip the html out of the text
Explanation:
[#\^&\*\$] matches on the characters, you want to replace
#\S+matches on hash tags
\S+[a-z0-9]\.(com|net|org) matches on domain names
If the URLs can't be identified by https?, you'll have to complete the list of potential TLDs.
Demo

Searching a text file and grabbing all lines that do not include ## in python

I am trying to write a python script to read in a large text file from some modeling results, grab the useful data and save it as a new array. The text file is output in a way that has a ## starting each line that is not useful. I need a way to search through and grab all the lines that do not include the ##. I am used to using grep -v in this situation and piping to a file. I want to do it in python!
Thanks a lot.
-Tyler

I would use something like this:
fh = open(r"C:\Path\To\File.txt", "r")
raw_text = fh.readlines()
clean_text = []
for line in raw_text:
if not line.startswith("##"):
clean_text.append(line)
Or you could also clean the newline and carriage return non-printing characters at the same time with a small modification:
for line in raw_text:
if not line.startswith("##"):
clean_text.append(line.rstrip("\r\n"))
You would be left with a list object that contains one line of required text per element. You could split this into individual words using string.split() which would give you a nested list per original list element which you could easily index (assuming your text has whitespaces of course).
clean_text[4][7]
would return the 5th line, 8th word.
Hope this helps.
[Edit: corrected indentation in loop]

My suggestion would be to do the following:
listoflines = [ ]
with open(.txt, "r") as f: # .txt = file, "r" = read
for line in f:
if line[:2] != "##": #Read until the second character
listoflines.append(line)
print listoflines
If you're feeling brave, you can also do the following, CREDITS GO TO ALEX THORNTON:
listoflines = [l for l in f if not l.startswith('##')]
The other answer is great as well, especially teaching the .startswith function, but I think this is the more pythonic way and also has the advantage of automatically closing the file as soon as you're done with it.

How to Find a String in a Text File And Replace Each Time With User Input in a Python Script?

I am new to python so excuse my ignorance.
Currently, I have a text file with some words marked as <>.
My goal is to essentially build a script which runs through a text file with such marked words. Each time the script finds such a word, it would ask the user for what it wants to replace it with.
For example, if I had a text file:
Today was a <<feeling>> day.
The script would run through the text file so the output would be:
Running script...
feeling? great
Script finished.
And generate a text file which would say:
Today was a great day.
Advice?
Edit: Thanks for the great advice! I have made a script that works for the most part like I wanted. Just one thing. Now I am working on if I have multiple variables with the same name (for instance, "I am <>. Bob is also <>.") the script would only prompt, feeling?, once and fill in all the variables with the same name.
Thanks so much for your help again.

import re
with open('in.txt') as infile:
text = infile.read()
search = re.compile('<<([^>]*)>>')
text = search.sub(lambda m: raw_input(m.group(1) + '? '), text)
with open('out.txt', 'w') as outfile:
outfile.write(text)

Basically the same solution as that offerred by #phihag, but in script form
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import argparse
import re
from os import path
pattern = '<<([^>]*)>>'
def user_replace(match):
return raw_input('%s? ' % match.group(1))
def main():
parser = argparse.ArgumentParser()
parser.add_argument('infile', type=argparse.FileType('r'))
parser.add_argument('outfile', type=argparse.FileType('w'))
args = parser.parse_args()
matcher = re.compile(pattern)
for line in args.infile:
new_line = matcher.sub(user_replace, line)
args.outfile.write(new_line)
args.infile.close()
args.outfile.close()
if __name__ == '__main__':
main()
Usage: python script.py input.txt output.txt
Note that this script does not account for non-ascii file encoding.

To open a file and loop through it:
Use raw_input to get input from user
Now, put this together and update you question if you run into problems :-)

I understand you want advice on how to structure your script, right? Here's what I would do:
Read the file at once and close it (I personally don't like to have open file objects, especially if my filesystem is remote).
Use a regular expression (phihag has suggested one in his answer, so I won't repeat it) to match the pattern of your placeholders. Find all of your placeholders and store them in a dictionary as keys.
For each word in the dictionary, ask the user with raw_input (not just input). And store them as values in the dictionary.
When done, parse your text substituting any instance of a given placeholder (key) with the user word (value). This is also done with regex.
The reason for using a dictionary is that a given placeholder could occur more than once and you probably don't want to make the user repeat the entry over and over again...

Try something like this
lines = []
with open(myfile, "r") as infile:
lines = infile.readlines()
outlines = []
for line in lines:
index = line.find("<<")
if index > 0:
word = line[index+2:line.find(">>")]
input = raw_input(word+"? ")
outlines.append(line.replace("<<"+word+">>", input))
else:
outlines.append(line)
with open(outfile, "w") as output:
for line in outlines:
outfile.write(line)
Disclaimer: I haven't actually run this, so it might not work, but it looks about right and is similar to something I've done in the past.
How it works:
It parses the file in as a list where each element is one line of the file.
It builds the output list of lines. It iterates through the lines in the input, checking if the string << exist. If it does, it rips out the word inside the << and >> brackets, using it as the question for a raw_input query. It takes the input from that query and replaces the value inside the arrows (and the arrows) with the input. It then appends this value to the list. If it didn't see the arrows it simply appended the line.
After running through all the lines, it writes them to the output file. You can make this whatever file you want.
Some issues:
As written, this will work for only one arrow statement per line. So if you had <<firstname>> <<lastname>> on the same line it would ignore the lastname portion. Fixing this wouldn't be too hard to implement - you could place a while loop using the index > 0 statement and holding the lines inside that if statement. Just remember to update the index again if you do that!
It iterates through the list three times. You could likely reduce this to two, but if you have a small text file this shouldn't be a huge problem.
It could be sensitive to encoding - I'm not entirely sure about that however. Worst case there you need to cast as a string.
Edit: Moved the +2 to fix the broken if statement.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting data from a text file with Python - python

Related

Is there a way to select certain elements from a separate text file in Python?

please explain how to use split in this piece of homework

How to write clean data to a file in python in tabulated format

Searching a text file and grabbing all lines that do not include ## in python

How to Find a String in a Text File And Replace Each Time With User Input in a Python Script?

Categories

Resources