Best way to parse a file Python

Best way to parse a file Python - python

I have a text file that I need to read, identify some parts to change, and write to a new file. Here's a snippet of what the text file (which is about 600 lines long) would look similar to:
<REAPER_PROJECT 0.1 "4.731/x64" 1431724762
RIPPLE 0
RECORD_PATH "Audio" ""
<RECORD_CFG
ZXZhdxgA
>
<APPLYFX_CFG
>
LOCK 1
<METRONOME 6 2
VOL 0.25 0.125
FREQ 800 1600 1
BEATLEN 4
SAMPLES "" ""
>
>
So, for example, I'd need to change "LOCK 1" to "LOCK 0". Right now I'm reading the file line by line, looking for when I hit the "LOCK" keyword and then instead of writing "LOCK 1", I write "LOCK 0" (all other lines are written as is). Pretty straightforward.
Part of this seems kinda messy to me, though, as sometimes when I have to use nested for loops to parse a sub-section of the text file I run into weirdness dealing with the file pointer off-by-one errors - not a biggie and manageable, but I was kinda looking for some opinions on this. Instead, I was wondering if it would make more sense to read the entire file into a list, parse through the list, looking for keywords to change, updating those specific lines in the list, and then writing the whole list to the new file. It seems like I would have a bit more control over things as I wouldn't have to process the file in a linear fashion which I'm kinda forced to do now.
So, I guess the last sentence kinda justified why it could be advantageous to pull it all into a list, process the list, and then write it out. I'm kinda curious how others with more programming experience (as mine is somewhat limited) would tackle this kind of issue. Any other ways that would prove even more efficient?
Btw, I didn't generate this file - other software did, and I don't have any communication with the developer so I have no way of knowing what they're using to read/write the file. I'd absolutely love it if I had a neat reader that could read the file and populate it into variables and then rewrite it out, but for me to code something that would do that would be overkill for what I'm trying to accomplish.
I'm kinda tempted to rewrite my script to read it into a list as it seems like it would be a better way to go, but I thought I'd ask people what they thought before I did. My version works, but I don't mind going through the motions, either, as it's a good lesson regardless. I figured this could also be a case where there are always different ways to tackle a problem, but I'd like to try and be as efficient as possible.
UPDATE
So, I probably should have mentioned this, but I was still trying to figure out what to ask - while I need to find certain elements and change them, I can only find those elements by finding their header (i.e. "ITEM") and then replacing the element within the block. So it'll be something like this:
<METRONOME
NAME Clicky
SPEED fast
>
<ITEM
LOOP 0
NAME Mike
FILE something.wav
..
>
<ITEM
LOOP 1
NAME Joe
FILE anotherfile.wav
..
>
So the only way to identify the correct block of data is to first find the ITEM header, then keep reading until I find the NAME element, and then update the file name for that whole ITEM block. There are other elements within that block that I need to update, and the name header isn't the first item. Also, I can't assume that the name element also exists just in ITEM blocks.
So maybe this really has less to do with reading it into memory and more of how to properly parse this type of file? Or are there some benefits to reading it into memory and being easier to manipulate? Sorry I didn't clarify that in the original question...

If it has only ~600 lines, you can take it into memory
replace = [('LOCK 1', 'LOCK 0'), (), ()....]
with open('read.txt') as r:
read = r.read()
for i in replace:
read.replace(*i)
with open('write.txt', 'w') as w:
w.write(read)

Here's my answer using regex:
import re
text = """<REAPER_PROJECT 0.1 "4.731/x64" 1431724762
RIPPLE 0
RECORD_PATH "Audio" ""
<RECORD_CFG
ZXZhdxgA
>
<APPLYFX_CFG
>
LOCK 1
<METRONOME 6 2
VOL 0.25 0.125
FREQ 800 1600 1
BEATLEN 4
SAMPLES "" ""
>
>
"""
print(re.sub("LOCK 1\D", "LOCK 0" + "\n", text))
If you're interested in writing the file to disk.
with open("written.txt", 'w') as f:
f.write(re.sub("LOCK 1\D", "LOCK 0" + "\n", text))
EDIT
You said that you wanted it to be more flexible?
Okay, I tried to make an example, however for that I would need more information about your setup..etc. So instead, I'll point you to a resource that could help you. This will also be good, if you ever want to change or add anything, now you'll understand what to do.
https://www.youtube.com/watch?v=DRR9fOXkfRE # How regex works for
python in general.
https://regexone.com/references/python # Some information about
regex and python.
https://stackoverflow.com/a/5658439/4837005 # An example of using
regex to replace a string.
I hope this helps.

Related

Delete a Portion of a CSV Cell in Python

I have recently stumbled upon a task utilizing some CSV files that are, to say the least, very poorly organized, with one cell containing what should be multiple separate columns. I would like to use this data in a Python script but want to know if it is possible to delete a portion of the row (all of it after a certain point) then write that to a dictionary.
Although I can't show the exact contents of the CSV, it looks like this:
useful. useless useless useless useless
I understand that this will most likely require either a regular expression or an endswith statement, but doing all of that to a CSV file is beyond me. Also, the period written after useful on the CSV should be removed as well, and is not a typo.

If you know the character you want to split on you can use this simple method:
good_data = bad_data.split(".")[0]
good_data = good_data.strip() # remove excess whitespace at start and end
This method will always work. split will return a tuple which will always have at least 1 entry (the full string). Using index may throw an exception.
You can also limit the # of splits that will happen if necessary using split(".", N).
https://docs.python.org/2/library/stdtypes.html#str.split
>>> "good.bad.ugly".split(".", 1)
['good', 'bad.ugly']
>>> "nothing bad".split(".")
['nothing bad']
>>> stuff = "useful useless"
>>> stuff = stuff[:stuff.index(".")]
ValueError: substring not found

Actual Answer
Ok then notice that you can use indexing for strings just like you do for lists. I.e. "this is a very long string but we only want the first 4 letters"[:4] gives "this". If we now new the index of the dot we could just get what you want like that. For exactly that strings have the index method. So in total you do:
stuff = "useful. useless useless useless useless"
stuff = stuff[:stuff.index(".")]
Now stuff is very useful :).
In case we are talking about a file containing multiple lines like that you could do it for each line. Split that line at , and put all in a dictionary.
data = {}
with open("./test.txt") as f:
for i, line in enumerate(f.read().split("\n")):
csv_line = line[:line.index(".")]
for j,col in enumerate(csv_line.split(",")):
data[(i,j)] = col
How one would do this
Notice that most people would not want to do it by hand. It is a common task to work on tabled data and there is a library called pandas for that. Maybe it would be a good idea to familiarise yourself a bit more with python before you dive into pandas though. I think a good point to start is this. Using pandas your task would look like this
import pandas as pd
pd.read_csv("./test.txt", comment=".")
giving you what is called a dataframe.

How do I write the scores of my game into the leaderboard.txt and display the top 5 scores only with each player's name

I'm currently doing a dice game for my school programming project, and it includes the rules: 'Stores the winner’s score, and their name, in an external file' and 'Displays the score and player name of the top 5 winning scores from the external file.' I've done everything in my game apart from the leaderboard. I am able to write the name and score of the user into the txt file, but I am unsure of how to then sort it, and would also like when people first start the program can use the menu to go to the leaderboard where it would then read the txt file and print the top 5 scores in order including the names.
I've checked loads of over questions similar to mine, but none of them exactly worked for my code as I kept on getting errors implementing other people's code into mine which just weren't compatible with my layout.
(deleted)
Thanks in advance, I've never used stack overflow to ask a question so I apologize if there's anything I've done wrong in my post.

You did good on the question. You stated the problem clearly and, most importantly, you added enough code to run the code so we can have a look at how the program behaves and what's going wrong. In this case nothing is going wrong which is good :)
Considering you mention that this is a school project I will not give you a fully copy/paste solution but will explain hopefully enough details on how to solve this on your own.
Now according to the question, you don't know how to sort your leaderboard. I ran the program a few times myself (after removing the sleeps because I am impatient 😋) and see that your leaderboard file looks like this:
90 - somename
38 - anothername
48 - yetanothername
To display this you must do two things:
Open the file and read the data
Convert the data from the file into something usable by the program
The first step seems to be something you already know as you already use open() to write into the file. Reading is very similar.
The next step is not so obvious if you are new to programming. The file is read as text-data, and you need to sort it by numbers. For a computer the text "10" is not the same a the number 10 (note the quotes). You can try this by opening a Python shell:
>>> 10 == 10
True
>>> 10 == "10"
False
>>> "10" == 10
False
And text sorts differently to numbers. So you one piece of the solution is to convert the text into numbers.
You will also get the data as lines (either using readlines() or splitlines() depending on how you use it. These line need to be split into score and name. The pattern in the file is this:
<score> - <name>
It is important to notice that you have the text " - " as separator between the two (including spaces). Have a look at the Python functions str.split() and str.partition(). These functions can be applied to any text value:
>>> "hello.world".split(".")
['hello', 'world']
>>> "hello.world".partition(".")
('hello', '.', 'world')
You can use this to "cut" the line into multiple pieces.
After doing that you have to remember the previous point about converting text to numbers.
As a last step you will need to sort the values.
When reading from the file, you can load the converted data into a Python list, which can then be sorted.
A convenient solution is to create a list where each element of that list is a tuple with the fields (score, name). Like that you can directly sort the list without any arcane tricks.
And finally, after sorting it, you can print it to the screen.
In summary
Open the file
Read the data from the file as "lines"
Create a new, empty list.
Loop over each line and...
... split the line into multiple parts to get at the score and name separately
... convert the score into a number
... append the two values to the new list from point 3
Sort the list from point 3
Print out the list.
Some general thoughts
You can improve and simplify the code by using more functions
You already show that you know how to use functions. But look at the comments #THIS IS ROUND1 to #THIS IS ROUND5. The lines of code for each round are the same. By moving those lines into a function you will save a lot of code. And that has two benefits: You will only need to make a code-change (improvement or fix) in one place. And secondly, you guarantee that all blocks behave the same.
To do this, you need to think about what variables that block needs (those will be the new function arguments) and what the result will be (that will be the function return value).
A simple example with duplication:
print("round 1")
outcomes = []
value1 = random(1, 100)
value2 = random(1, 100)
if value1 > value2:
outcomes.append("A")
else:
outcomes.append("B")
print("round 2")
outcome = ""
value1 = random(1, 100)
value2 = random(1, 100)
if value1 > value2:
outcomes.append("A")
else:
outcomes.append("B")
Rewritten with functions
def run_round(round_name):
print(round_name)
value1 = random(1, 100)
value2 = random(1, 100)
if value1 > value2:
return "A"
else:
return "B"
outcomes = []
result_1 = run_round("round 1")
outcomes.append(result_1)
result_2 = run_round("round 2")
outcomes.append(result_2)
As you can see, the second code is much shorter and has no more duplication. Your code will have more function arguments. It is generally a challenge in programming to organise your code in such a way that functions have few arguments and no complex return values. Although, as long as it works nobody will look too closely ;)
Safe way to ask for a password
You can use getpass() from the module getpass() to prompt for a password in a secure manner:
from getpass import getpass
password = getpass()
Note however, if you are using PyCharm, this causes some issues which are out of scope of this post. In that case, stick with input().
Sleeps
The "sleep()" calls are nice and give you the chance to follow the program, but make it slow to test the program. Consider to use smaller values (comma-values are possible), or, even better, write your own function that you can "short-circuit" for testing. Something like this:
import time
ENABLE_SLEEP = True
def sleep(s):
if ENABLE_SLEEP:
time.sleep(s)
print("some code")
sleep(1)
print("more code")
sleep(4)
You will then use your own sleep() function anytime you want to wait. That way, you can simply set the variable ENABLE_SLEEP to False and your code will run fast (for testing).

check csv every 5 rows with condition using python3.x

csv data:
>c1,v1,c2,v2,Time
>13.9,412.1,29.7,177.2,14:42:01
>13.9,412.1,29.7,177.2,14:42:02
>13.9,412.1,29.7,177.2,14:42:03
>13.9,412.1,29.7,177.2,14:42:04
>13.9,412.1,29.7,177.2,14:42:05
>0.1,415.1,1.3,-0.9,14:42:06
>0.1,408.5,1.2,-0.9,14:42:07
>13.9,412.1,29.7,177.2,14:42:08
>0.1,413.4,1.3,-0.9,14:42:09
>0.1,413.8,1.3,-0.9,14:42:10
My current code that I have:
import pandas as pd
import csv
import datetime as dt
#Read .csv file, get timestamp and split it into date and time separately
Data = pd.read_csv('filedata.csv', parse_dates=['Time_Stamp'], infer_datetime_format=True)
Data['Date'] = Data.Time_Stamp.dt.date
Data['Time'] = Data.Time_Stamp.dt.time
#print (Data)
print (Data['Time_Stamp'])
Data['Time_Stamp'] = pd.to_datetime(Data['Time_Stamp'])
#Read timestamp within a certain range
mask = (Data['Time_Stamp'] > '2017-06-12 10:48:00') & (Data['Time_Stamp']<= '2017-06-12 11:48:00')
june13 = Data.loc[mask]
#print (june13)
What I'm trying to do is to read every 5 secs of data, and if 1 out of 5 secs of data of c1 is 10.0 and above, replace that value of c1 with 0.
I'm still new to python and I could not find examples for this. May I have some assistance as this problem is way beyond my python programming skills for now. Thank you!

I don't know the modules around csv files so my answer might look primitive, and I'm not quite sure what you are trying to accomplish here, but have you though of dealing with the file textually ?
From what I get, you want to read every c1, check the value and modify it.
To read and modify the file, you could do:
with open('filedata.csv', 'r+') as csv_file:
lines = csv_file.readlines()
# for each line, isolate data part and check - and modify, the first one if needed.
# I'm seriously not sure, you might have wanted to read only one out of five lines.
# For that, just do a while loop with an index, which increments through lines by 5.
for line in lines:
line = line.split(',') # split comma-separated-values
# Check condition and apply needed change.
if float(line[0]) >= 10:
line[0] = "0" # Directly as a string.
# Transform the list back into a single string.
",".join(line)
# Rewrite the file.
csv_file.seek(0)
csv_file.writelines(lines)
# Here you are ready to use the file just like you were already doing.
# Of course, the above code could be put in a function for known advantages.
(I don't have python here, so I couldn't test it and typos might be there.)
If you only need the dataframe without the file being modified:
Pretty much the same to be honest.
Instead of the file-writing at the end, you could do :
from io import StringIO # pandas needs stringIO instead of strings.
# Above code here, but without the last 6 lines.
Data = pd.read_csv(
StringIo("\n".join(lines)),
parse_dates=['Time_Stamp'],
infer_datetime_format=True
)
This should give you the Data you have, with changed values where needed.
Hope this wasn't completely off. Also, some people might find this approach horrible ; we have already coded working modules to do that kind of things, so why botter and dealing with the rough raw data ourselves ? Personally, I think that it's often much easier than learning all of the external modules I'll be using in my life if I don't try to understand how the text representation of files can be used. Your opinion might differ.
Also, this code might result in performances being lower, as we need to iterate through the text twice (pandas does it when reading). However, I don't think you'd get faster result by reading the csv like you already do, then iterate through data anyway to check condition. (You might win a cast per c1 checked value, but the difference is small and iterating through pandas dataframe might as well be slower than a list, depending on the state of their current optimisation.)
Of course, if you don't really need the pandas dataframe format, you could completely do it manually, it would take only a few more lines (or not, tbh) and shouldn't be slower, as the amount of iterations would be minimized : you could check conditions on data at the same time as you read it. It's getting late and I'm sure you can figure that out by yourself so I won't code it in my great editor (known as stackoverflow), ask if there's anything !

Reading text and assigning a class to data in Python

I've been searching around, and had no luck finding anything answering my question.
Essentially I have a file with the following data:
Title - 19
Artist - Adele
Year released - 2008
1 - Daydreamer, 3:41, 1
2 - Best for Last, 4:19, 5
3 - Chasing Pavements, 3:31, 7
4 - Cold Shoulder, 3:12, 3
Title - El Camino
Artist - The Black Keys
Year released - 2011
1 - Lonely Boy, 3:13, 1
2 - Run Right Back, 3:17, 10
EOF
I know how to create classes, and how to assign an object to a class and values to that object, but I am just about ready to tear my hair out on how it is I'm supposed to process the text. From text, I need to create a title for the album, and assign the album's information to it. There's more else besides that needs to be done, and there are more lines to be read, and I just don't know where to start on this. I've found two "album.py" files via google, and I've been unable to make heads or tails of how to apply the solution to my case.
And yes, this is for a school assignment. I've done some digging around and found some things relevant, but I'm just not understanding it. I'm new to programming in general, and I've made progress but this seems too far over my head.
I know I could reduce this to lists using split (\n\n) and operating on a series of progressively smaller lists, but I am trying to avoid this method at all costs.
EDIT:
For the time being, it's best to assume I know nothing. Though, to answer below question: I can open the file and read it. If its a consistent CSV formatted file, I can write code to process the enclosed data, and create a class structure that uses that data. Right now I'm just having trouble with the first three lines, and the digits immediately below.
APRIL 4 2012:
Okay, I have some code, I've left the comments with respect to it underneath.
def getInput():
global albums
raw = open("album.txt","r")
infile = raw
raw.close
text=""
line = infile.readline()
while (line != "EOF\n" ):
text += line
line=infile.readline()
text=text.rstrip("\n\n")
albums=[str(n) for n in text.split("\n\n")]
return albums
class Album():
def __init__(self, title, artist, date):
self.title=title
self.artist=artist
self.date=date
self.track={}
def addSong(self, TrackID, title, time, ranking):
self.track+={self}
def getAlbumLength(self):
asdf=0
def getRanking(self):
asdf=0
def labels(x): #establishes labels per item to be used for Album Classifier
title=""
artist=""
date=""
for i in range(0,len(albums),1):
sublist=[str(n) for n in albums[i].split("\n")]
RANDUMB=len(albums[i])
title=sublist[0]
artist=sublist[1]
date=sublist[2]
for j in range(0,len(sublist),1):
song_info = [str(k) for k in sublist[3:].split("," and " - ")]
TrackID=song_info[0]
title=song_info[1]
time=song_info[2]
ranking=song_info[3]
getInput()
labels(albums)
Personal comments on code:
I was trying to avoid getting it into lists because I anticipated this problem. As the functions are concerned, I have to use every single bloody one, because it's in the assignment requirements... I am displeased because I could probably get around using them. The code is working sufficiently enough, except for the last part of it where I am trying to take the song information. I want to split the song information into lists, which are nested into the album information list. Something like:
[Album title, Artist, Date released,[01,Song,3:44,2],[02,Song,0:01,9]....]
The current code gives me index out of range error as of right now... I am using python3.
TLDR: The substance of my problem has thus changed from one of trying to solve how to go about starting the solution to how to take items in a list and convert them into nested lists.

If you end up editing your question to contain some more specific examples of what is giving you trouble, I will edit this answer. But to address your general question, there are some steps involved to achieving your goal.
Like you said, you need to write a class that reflects the structure you intend to have from this data.
You will need to parse this file, probably line by line. So you have to determine if this file format is consistant. If it is, then you need to determine:
What is the delimiter between each set of data, which will be conformed into a class instance?
What is the delimiter between each field of each line?
When you are looping over each line, you will know that you need to start a new album object whenever you encounter a blank line.
When you know you are starting a new album, you can assume that the first line will be a title, the second an artist, the third, the year, etc.
For each of these lines you will also have to have rules of how to split each one into the data you want. At a basic level it can be a simple set of splits. At a more advanced level you might define regular expressions for each type of line.

Most efficient way in Python to iterate over a large file (10GB+)

I'm working on a Python script to go through two files - one containing a list of UUIDs, the other containing a large amount of log entries - each line containing one of the UUIDs from the other file. The purpose of the program is to create a list of the UUIDS from file1, then for each time that UUID is found in the log file, increment the associated value for each time a match is found.
So long story short, count how many times each UUID appears in the log file.
At the moment, I have a list which is populated with UUID as the key, and 'hits' as the value. Then another loop which iterates over each line of the log file, and checking if the UUID in the log matches a UUID in the UUID list. If it matches, it increments the value.
for i, logLine in enumerate(logHandle): #start matching UUID entries in log file to UUID from rulebase
if logFunc.progress(lineCount, logSize): #check progress
print logFunc.progress(lineCount, logSize) #print progress in 10% intervals
for uid in uidHits:
if logLine.count(uid) == 1: #for each UUID, check the current line of the log for a match in the UUID list
uidHits[uid] += 1 #if matched, increment the relevant value in the uidHits list
break #as we've already found the match, don't process the rest
lineCount += 1
It works as it should - but I'm sure there is a more efficient way of processing the file. I've been through a few guides and found that using 'count' is faster than using a compiled regex. I thought reading files in chunks rather than line by line would improve performance by reducing the amount of disk I/O time but the performance difference on a test file ~200MB was neglible. If anyone has any other methods I would be very grateful :)

Think functionally!
Write a function which will take a line of the log file and return the uuid. Call it uuid, say.
Apply this function to every line of the log file. If you are using Python 3 you can use the built-in function map; otherwise, you need to use itertools.imap.
Pass this iterator to a collections.Counter.
collections.Counter(map(uuid, open("log.txt")))
This will be pretty much optimally efficient.
A couple comments:
This completely ignores the list of UUIDs and just counts the ones that appear in the log file. You will need to modify the program somewhat if you don't want this.
Your code is slow because you are using the wrong data structures. A dict is what you want here.

This is not a 5-line answer to your question, but there was an excellent tutorial given at PyCon'08 called Generator Tricks for System Programmers. There is also a followup tutorial called A Curious Course on Coroutines and Concurrency.
The Generator tutorial specifically uses big log file processing as its example.

Like folks above have said, with a 10GB file you'll probably hit the limits of your disk pretty quickly. For code-only improvements, the generator advice is great. In python 2.x it'll look something like
uuid_generator = (line.split(SPLIT_CHAR)[UUID_FIELD] for line in file)
It sounds like this doesn't actually have to be a python problem. If you're not doing anything more complex than counting UUIDs, Unix might be able to solve your problems faster than python can.
cut -d${SPLIT_CHAR} -f${UUID_FIELD} log_file.txt | sort | uniq -c

Have you tried mincemeat.py? It is a Python implementation of the MapReduce distributed computing framework. I'm not sure if you'll have performance gain since I've not yet processed 10GB of data before using it, though you might explore this framework.

Try measuring where most time is spent, using a profiler http://docs.python.org/library/profile.html
Where best to optimise will depend on the nature of your data: If the list of uuids isn't very long, you may find, for example, that a large proportion of time is spend on the "if logFunc.progress(lineCount, logSize)". If the list is very long, you it could help to save the result of uidHits.keys() to a variable outside the loop and iterate over that instead of the dictionary itself, but Rosh Oxymoron's suggesting of finding the id first and then checking for it in uidHits would probably help even more.
In any case, you can eliminate the lineCount variable, and use i instead. And find(uid) != -1 might be better than count(uid) == 1 if the lines are very long.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.