Parsing and arranging text in python

Parsing and arranging text in python - python

I'm having some trouble figuring out the best implementation
I have data in file in this format:
|serial #|machine_name|machine_owner|
If a machine_owner has multiple machines, I'd like the machines displayed in a comma separated list in the field. so that.
|1234|Fred Flinstone|mach1|
|5678|Barney Rubble|mach2|
|1313|Barney Rubble|mach3|
|3838|Barney Rubble|mach4|
|1212|Betty Rubble|mach5|
Looks like this:
|Fred Flinstone|mach1|
|Barney Rubble|mach2,mach3,mach4|
|Betty Rubble|mach5|
Any hints on how to approach this would be appreciated.

You can use dict as temporary container to group by name and then print it in desired format:
import re
s = """|1234|Fred Flinstone|mach1|
|5678|Barney Rubble|mach2|
|1313|Barney Rubble||mach3|
|3838|Barney Rubble||mach4|
|1212|Betty Rubble|mach5|"""
results = {}
for line in s.splitlines():
_, name, mach = re.split(r"\|+", line.strip("|"))
if name in results:
results[name].append(mach)
else:
results[name] = [mach]
for name, mach in results.items():
print(f"|{name}|{','.join(mach)}|")

You need to store all the machines names in a list. And every time you want to append a machine name, you run a function to make sure that the name is not already in the list, so that it will not put it again in the list.
After storing them in an array called data. Iterate over the names. And use this function:
data[i] .append( [ ] )
To add a list after each machine name stored in the i'th place.
Once your done, iterate over the names and find them in in the file, then append the owner.
All of this can be done in 2 steps.

Related

How can I take a text file and create a triple nested list from it with tkinter python

I'm making a program that allows the user to log loot they receive from monsters in an MMO. I have the drop tables for each monster stored in text files. I've tried a few different formats but I still can't pin down exactly how to take that information into python and store it into a list of lists of lists.
The text file is formatted like this
item 1*4,5,8*ns
item 2*3*s
item 3*90,34*ns
The item # is the name of the item, the numbers are different quantities that can be dropped, and the s/ns is whether the item is stackable or not stackable in game.
I want the entire drop table of the monster to be stored in a list called currentDropTable so that I can reference the names and quantities of the items to pull photos and log the quantities dropped and stuff.
The list for the above example should look like this
[["item 1", ["4","5","8"], "ns"], ["item 2", ["2","3"], "s"], ["item 3", ["90","34"], "ns"]]
That way, I can reference currentDropTable[0][0] to get the name of an item, or if I want to log a drop of 4 of item 1, I can use currentDropTable[0][1][0].
I hope this makes sense, I've tried the following and it almost works, but I don't know what to add or change to get the result I want.
def convert_drop_table(list):
global currentDropTable
currentDropTable = []
for i in list:
item = i.split('*')
currentDropTable.append(item)
dropTableFile = open("droptable.txt", "r").read().split('\n')
convert_drop_table(dropTableFile)
print(currentDropTable)
This prints everything properly except the quantities are still an entity without being a list, so it would look like
[['item 1', '4,5,8', 'ns'], ['item 2', '2,3', 's']...etc]
I've tried nesting another for j in i, split(',') but then that breaks up everything, not just the list of quantities.
I hope I was clear, if I need to clarify anything let me know. This is the first time I've posted on here, usually I can just find another solution from the past but I haven't been able to find anyone who is trying to do or doing what I want to do.
Thank you.

You want to split only the second entity by ',' so you don't need another loop. Since you know that item = i.split('*') returns a list of 3 items, you can simply change your innermost for-loop as follows,
for i in list:
item = i.split('*')
item[1] = item[1].split(',')
currentDropTable.append(item)
Here you replace the second element of item with a list of the quantities.

You only need to split second element from that list.
def convert_drop_table(list):
global currentDropTable
currentDropTable = []
for i in list:
item = i.split('*')
item[1] = item[1].split(',')
currentDropTable.append(item)

The first thing I feel bound to say is that it's usually a good idea to avoid using global variables in any language. Errors involving them can be hard to track down. In fact you could simply omit that function convert_drop_table from your code and do what you need in-line. Then readers aren't obliged to look elsewhere to find out what it does.
And here's yet another way to parse those lines! :) Look for the asterisks then use their positions to select what you want.
currentDropTable = []
with open('droptable.txt') as droptable:
for line in droptable:
line = line.strip()
p = line.find('*')
q = line.rfind('*')
currentDropTable.append([line[0:p], line[1+p:q], line[1+q:]])
print (currentDropTable)

Count and flag duplicates in a column in a csv

this type of question has been asked many times. So apologies; I have searched hard to get an answer - but have not found anything that is close enough to my needs (and I am not sufficiently advanced (I am a total newbie) to customize an existing answer). So thanks in advance for any help.
Here's my query:
I have 30 or so csv files and each contains between 500 and 15,000 rows.
Within each of them (in the 1st column) - are rows of alphabetical IDs (some contain underscores and some also have numbers).
I don't care about the unique IDs - but I would like to identify the duplicate IDs and the number of times they appear in all the different csv files.
Ideally I'd like the output for each duped ID to appear in a new csv file and be listed in 2 columns ("ID", "times_seen")
It may be that I need to compile just 1 csv with all the IDs for your code to run properly - so please let me know if I need to do that
I am using python 2.7 (a crawling script that I run needs this version, apparently).
Thanks again

It seems the most easy way to achieve want you want would make use of dictionaries.
import csv
import os
# Assuming all your csv are in a single directory we will iterate on the
# files in this directory, selecting only those ending with .csv
# to list files in the directory we will use the walk function in the
# os module. os.walk(path_to_dir) returns a generator (a lazy iterator)
# this generator generates tuples of the form root_directory,
# list_of_directories, list_of_files.
# So: declare the generator
file_generator = os.walk("/path/to/csv/dir")
# get the first values, as we won't recurse in subdirectories, we
# only ned this one
root_dir, list_of_dir, list_of_files = file_generator.next()
# Now, we only keep the files ending with .csv. Let me break that down
csv_list = []
for f in list_of_files:
if f.endswith(".csv"):
csv_list.append(f)
# That's what was contained in the line
# csv_list = [f for _, _, f in os.walk("/path/to/csv/dir").next() if f.endswith(".csv")]
# The dictionary (key value map) that will contain the id count.
ref_count = {}
# We loop on all the csv filenames...
for csv_file in csv_list:
# open the files in read mode
with open(csv_file, "r") as _:
# build a csv reader around the file
csv_reader = csv.reader(_)
# loop on all the lines of the file, transformed to lists by the
# csv reader
for row in csv_reader:
# If we haven't encountered this id yet, create
# the corresponding entry in the dictionary.
if not row[0] in ref_count:
ref_count[row[0]] = 0
# increment the number of occurrences associated with
# this id
ref_count[row[0]]+=1
# now write to csv output
with open("youroutput.csv", "w") as _:
writer = csv.writer(_)
for k, v in ref_count.iteritems():
# as requested we only take duplicates
if v > 1:
# use the writer to write the list to the file
# the delimiters will be added by it.
writer.writerow([k, v])
You may need to tweek a little csv reader and writer options to fit your needs but this should do the trick. You'll find the documentation here https://docs.python.org/2/library/csv.html. I haven't tested it though. Correcting the little mistakes that may have occurred is left as a practicing exercise :).

That's rather easy to achieve. It would look something like:
import os
# Set to what kind of separator you have. '\t' for TAB
delimiter = ','
# Dictionary to keep count of ids
ids = {}
# Iterate over files in a dir
for in_file in os.listdir(os.curdir):
# Check whether it is csv file (dummy way but it shall work for you)
if in_file.endswith('.csv'):
with open(in_file, 'r') as ifile:
for line in ifile:
my_id = line.strip().split(delimiter)[0]
# If id does not exist in a dict = set count to 0
if my_id not in ids:
ids[my_id] = 0
# Increment the count
ids[my_id] += 1
# saves ids and counts to a file
with open('ids_counts.csv', 'w') as ofile:
for key, val in ids.iteritems():
# write down counts to a file using same column delimiter
ofile.write('{}{}{}\n'.format(key, delimiter, value))

Check out the pandas package. You can read an write csv files quite easily with it.
http://pandas.pydata.org/pandas-docs/stable/10min.html#csv
Then, when having the csv-content as a dataframe you convert it with the as_matrix function.
Use the answers to this question to get the duplicates as a list.
Find and list duplicates in a list?
I hope this helps

As you are a newbie, Ill try to give some directions instead of posting an answer. Mainly because this is not a "code this for me" platform.
Python has a library called csv, that allows to read data from CSV files (Boom!, surprised?). This library allows you to read the file. Start by reading the file (preferably an example file that you create with just 10 or so rows and then increase the amount of rows or use a for loop to iterate over different files). The examples in the bottom of the page that I linked will help you printing this info.
As you will see, the output you get from this library is a list with all the elements of each row. Your next step should be extracting just the ID that you are interested in.
Next logical step is counting the amount of appearances. There is also a class from the standard library called counter. They have a method called update that you can use as follows:
from collections import Counter
c = Counter()
c.update(['safddsfasdf'])
c # Counter({'safddsfasdf': 1})
c['safddsfasdf'] # 1
c.update(['safddsfasdf'])
c # Counter({'safddsfasdf': 2})
c['safddsfasdf'] # 2
c.update(['fdf'])
c # Counter({'safddsfasdf': 2, 'fdf': 1})
c['fdf'] # 1
So basically you will have to pass it a list with the elements you want to count (you could have more than 1 id in the list, for exampling reading 10 IDs before inserting them, for improved efficiency, but remember not constructing a thousands of elements list if you are seeking good memory behaviour).
If you try this and get into some trouble come back and we will help further.
Edit
Spoiler alert: I decided to give a full answer to the problem, please avoid it if you want to find your own solution and learn Python in the progress.
# The csv module will help us read and write to the files
from csv import reader, writer
# The collections module has a useful type called Counter that fulfills our needs
from collections import Counter
# Getting the names/paths of the files is not this question goal,
# so I'll just have them in a list
files = [
"file_1.csv",
"file_2.csv",
]
# The output file name/path will also be stored in a variable
output = "output.csv"
# We create the item that is gonna count for us
appearances = Counter()
# Now we will loop each file
for file in files:
# We open the file in reading mode and get a handle
with open(file, "r") as file_h:
# We create a csv parser from the handle
file_reader = reader(file_h)
# Here you may need to do something if your first row is a header
# We loop over all the rows
for row in file_reader:
# We insert the id into the counter
appearances.update(row[:1])
# row[:1] will get explained afterwards, it is the first column of the row in list form
# Now we will open/create the output file and get a handle
with open(output, "w") as file_h:
# We create a csv parser for the handle, this time to write
file_writer = writer(file_h)
# If you want to insert a header to the output file this is the place
# We loop through our Counter object to write them:
# here we have different options, if you want them sorted
# by number of appearances Counter.most_common() is your friend,
# if you dont care about the order you can use the Counter object
# as if it was a normal dict
# Option 1: ordered
for id_and_times in apearances.most_common():
# id_and_times is a tuple with the id and the times it appears,
# so we check the second element (they start at 0)
if id_and_times[1] == 1:
# As they are ordered, we can stop the loop when we reach
# the first 1 to finish the earliest possible.
break
# As we have ended the loop if it appears once,
# only duplicate IDs will reach to this point
file_writer.writerow(id_and_times)
# Option 2: unordered
for id_and_times in apearances.iteritems():
# This time we can not stop the loop as they are unordered,
# so we must check them all
if id_and_times[1] > 1:
file_writer.writerow(id_and_times)
I offered 2 options, printing them ordered (based on Counter.most_common() doc) and unoredered (based on normal dict method dict.iteritems()). Choose one. From a speed point of view I'm not sure which one would be faster, as one first needs to order the Counter but also stops looping when finding the first element non-duplicated while the second doesn't need to order the elements but needs to loop every ID. The speed will probably be dependant on your data.
About the row[:1] thingy:
row is a list
You can get a subset of a list telling the initial and final positions
In this case the initial position is omited, so it defaults to the start
The final position is 1, so just the first element gets selected
So the output is another list with just the first element
row[:1] == [row[0]] They have the same output, getting a sublist of only the same element is the same that constructing a new list with only the first element

Most efficient way to merge two partially inclusive lists of files and their properties

I have a system that runs a custom cli with a variation of the ls or dir command, returning a list of files and folders in your working directory.
The problem is, I can either run the command with a flag that returns the files and their time stamps (date created and last modified), or one that returns the file and their file sizes. There is no way to get both in a single cli command.
A further complication arises when getting the time stamped list, only some of the files are returned (all files ending in certain prefixes are left out). Neither list is in any particular order.
I wish to create a dictionary that contains all the information for each file in one place. What is the cleanest, most efficient, and most pythonic way to do this?
Quick sample of data:
dir -time gives a list of 506 elements. Only (but not all) files ending in .ts have timestamps. Some files show in the list but do not have timestamps, some files (such as anything ending in .index) do not show up in the list at all.
ch20prefix_20_182.ts 2014-10-22 16:06:20 - 2014-10-22 16:08:51
ch21prefix_21_40.ts 2014-10-14 16:15:42 - 2014-10-14 16:16:51
modinfo_sdk1.23b24L
bs780_ntplatency
ch10prefix_10_237.ts 2014-10-27 11:05:10 - 2014-10-27 11:07:33
ch10prefix_10_277.ts 2014-10-30 14:03:51 - 2014-10-30 14:04:24
video1_6_1.ts
ch11prefix_11_179.ts 2014-10-22 14:53:50 - 2014-10-22 14:56:00`
dir -size gives a list of 967 elements. All files are present here, all files have a file size.
ch10prefix_10_340.index 159544
ch2prefix_2_705.ts 75958204
<ts220> 0
ch11prefix_11_148.ts 19877616
ch10prefix_10_310.ts 7373924
ch11prefix_11_111.index 17112
ch11prefix_11_278.index 1368
ch2prefix_2_307.ts 6492580
channelConfig.xml.2HD 18144
ch21prefix_21_220.ts 12893604
ch20prefix_20_128.index 1720
There is some rhyme and reason to the mess that is why some files show up and others don't, why some have timestamps and others don't, but that is largely irrelevant to this question.
My thoughts on how to approach it:
What I want as final output is a dictionary with each key as a file name, and it's value as another dictionary with key/val pairs for Time Created, Time Mod, fileSize. This way one can easily lookup all 3 pieces of information for each file.
The difficult part for me, however, is finding an efficient way of combining the data from each list. The first thing that comes to mind would be to loop through the larger list (file size), and then for each element, check if it is in the smaller list, and if it is (and has a timestamp), add the data. But that is horridly inefficient. Although some files in the larger list I know ahead of time do not have timestamps in the other list, I cannot say that for all files that don't have a timestamp.
The lists are unsorted, but It occurs to me that if they were sorted by file name, that allow for a much faster way of looking up each file from one list in another, but considering the runtime of sorting the lists, it still might not be worth the effort.
So, what would be the most efficient approach here? I am mostly concerned with run-time and readability, but welcome the inclusion of other factors in how I might approach this problem.

It is hard to tell from your question what your desired result is. If you want all files in both lists even if they only appear in one or the other just make one pass through both files and create a dictionary using collections.defaultdict
from collections import defaultdict
d = defaultdict(dict)
with open('fileA.txt') as f:
for line in f:
name, time = line[:24], line[24:]
name, time = name.strip(), time.strip()
time_created, time_modified = time.split(' - ')
d[name]['time_created'] = time_created
d[name]['time_modified'] = time_modified
with open('fileB.txt') as f:
for line in f:
name, size = line[:24], line[24:]
name, size = name.strip(), size.strip()
d[name]['size'] = size
If your final result only includes files that appear in both lists then make one pass over each list constructing separate dictionaries.
dA = defaultdict(dict)
dB = defaultdict(dict)
with open('fileA.txt') as f:
for line in f:
name, time = line[:24], line[24:]
name, time = name.strip(), time.strip()
try:
time_created, time_modified = time.split(' - ')
except ValueError:
time_created, time_modified = '', ''
dA[name]['time_created'] = time_created
dA[name]['time_modified'] = time_modified
with open('fileB.txt') as f:
for line in f:
name, size = line[:24], line[24:]
name, size = name.strip(), size.strip()
dB[name]['size'] = size
Then make a pass over one of those dictionaries creating a third dictionary with common keys.
d = defaultdict(dict)
for k, v in dA.items():
if k in dB:
d[k] = v
d[k].update(dB[k])
Since this is the only answer (so far) with a solution And #Brian C didn't offer one, this MUST be the most efficient.

Sounds like a good use case for Sqlite.
Python has good support for it. Instead of creating a disk file based DB you could use a pure in-memory based database by passing the right arguments. First I'd create a 2 tables - tblFileNTimeStamp (File name (PK), timestamp) and tblFileNSize (File name (PK), filesize). Use the output of the two commands to populate the database and use a join on the primary keys to pick the results you need.

Python extract substring with location of field and symbols

I have been trying to clean a field in a csv file. The field is populated with numbers and characters, which I read into a panda dataframe and convert to a string.
Goal is to extract following variables: StopId, StopCode (possible to have multiple for each record), rte, routeId from the long string. Here is what I attempted so far.
After extracting the variables listed above, I need to merge the variable/codes with another file with location data per each stop/route/rte.
Sample records for the FIELD:
'Web Log: Page generated Query [cid=SM&rte=50183&dir=S&day=5761&dayid=5761&fst=0%2c&tst=0%2c]'
'Web Log: Page generated Query: [_=1407744540393&agencyId=SM&stopCode=361096&rte=7878%7eBus%7e251&dir=W]'
Web Log: Page generated Query: [_=1407744956001&agencyId=AC&stopCode=55451&stopCode=55452stopCode=55489&&rte=43783%7eBus%7e88&dir=S]
Solutions I tried below, but I am stuck! Advice and recommendations are appreciated
# Idea 1: Splits field above in a loop by '&' into a list. This is useful but I'll
# have to write additional code to pull out relevant variables
i = 0
for t in data['EVENT_DESCRIPTION']:
s = list(t.split('&'))
data['STOPS'][i] = [ x for x in s if "Web Log" not in x ]
i+=1
# Idea 1 next step help - how to pull out necessary variables from the list in data['STOPS']
# Idea2: Loop through field with string to find the start and end of variable names. The output for stopcode_pl (et. al. variables) is tuple or list of tuples (if there are more than one in the string)
for i in data['EVENT_DESCRIPTION']:
stopcode_pl = [(a.start(), a.end() ) for a in list(re.finditer('stopCode=', i))]
stopid_pl = i[(a.start(), a.end() ) for a in list(re.finditer('stopId=', i))]
rte_pl = [(a.start(), a.end() ) for a in list(re.finditer('rte=', i))]
routeid_pl = [(a.start(), a.end() ) for a in list(re.finditer('routeId=', i))]
#Idea2: Next Step Help - how to use the string location for variable names to pull the number of the relevant variable. Is there a trick to grab the characters in between the variable name last place (i.e. after the '=' of the variable name) and the next '&'?

This function
def qdata(rec):
return [tuple(item.split('=')) for item in rec[rec.find('[')+1:rec.find(']')].split('&')]
yields, for instance, on the first record:
[('cid', 'SM'), ('rte', '50183'), ('dir', 'S'), ('day', '5761'), ('dayid', '5761'), ('fst', '0%2c'), ('tst', '0%2c')]
You can then step across that list searching for your specific items.

use slice in for loop to build a list

I would like to build up a list using a for loop and am trying to use a slice notation. My desired output would be a list with the structure:
known_result[i] = (record.query_id, (align.title, align.title,align.title....))
However I am having trouble getting the slice operator to work:
knowns = "output.xml"
i=0
for record in NCBIXML.parse(open(knowns)):
known_results[i] = record.query_id
known_results[i][1] = (align.title for align in record.alignment)
i+=1
which results in:
list assignment index out of range.
I am iterating through a series of sequences using BioPython's NCBIXML module but the problem is adding to the list. Does anyone have an idea on how to build up the desired list either by changing the use of the slice or through another method?
thanks zach cp
(crossposted at [Biostar])1

You cannot assign a value to a list at an index that doesn't exist. The way to add an element (at the end of the list, which is the common use case) is to use the .append method of the list.
In your case, the lines
known_results[i] = record.query_id
known_results[i][1] = (align.title for align in record.alignment)
Should probably be changed to
element=(record.query_id, tuple(align.title for align in record.alignment))
known_results.append(element)
Warning: The code above is untested, so might contain bugs. But the idea behind it should work.

Use:
for record in NCBIXML.parse(open(knowns)):
known_results[i] = (record.query_id, None)
known_results[i][1] = (align.title for align in record.alignment)
i+=1

If i get you right you want to assign every record.query_id one or more matching align.title. So i guess your query_ids are unique and those unique ids are related to some titles. If so, i would suggest a dictionary instead of a list.
A dictionary consists of a key (e.g. record.quer_id) and value(s) (e.g. a list of align.title)
catalog = {}
for record in NCBIXML.parse(open(knowns)):
catalog[record.query_id] = [align.title for align in record.alignment]
To access this catalog you could either iterate through:
for query_id in catalog:
print catalog[query_id] # returns the title-list for the actual key
or you could access them directly if you know what your looking for.
query_id = XYZ_Whatever
print catalog[query_id]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.