Parsing a pretty printed table into Python objects [closed] - python

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
Suppose there is a String called likes_and_dislikes, formatted visually as a table like that as shown below.
How can I parse the string and return a list of tuples with likes and dislikes. Also the top header(likes, dislikes) has to be removed from list of tuples.
likes_and_dislikes="""
+------------------------------------+-----------------------------------+
| likes | dislikes |
+------------------------------------+-----------------------------------+
| Meritocracy | Favoritism, ass-kissing, politics |
+------------------------------------+-----------------------------------+
| Healthy debates and collaboration | Ego-driven rhetoric, drama and FUD|
| | to get one's way |
+------------------------------------+-----------------------------------+
| Autonomy given by confident leaders| Micro-management by insecure |
| capable of attracting top-tier | managers compensating for a weak, |
| talent | immature team |
+------------------------------------+-----------------------------------+ """

The key here is to examine the table thoroughly and understand what you are trying to pull out.
First of all, parsing strings like this is generally easier when done line-by-line, so you need to split based on the table rows and then parse columns based on that. We do this primarily because the likes and dislikes span across lines.
1. Getting each row
We don't know how wide the table might be so we use regular expressions to break up our table like so:
pairs = re.split("\+-*\+-*\+\n?",likes_and_dislikes)[2:-1] #Drop the header and the tail
This gives us an array that correesponds to our multiline rows. The array slicing at the end removes the header and any trailing whitespace that we don't want to process. However, we still have the problem of pulling together strings that span multiple lines in a cell.
2. Finding a like and dislike
If we iterate through this array of rows, we know each row has a like and a dislike that spans an unknown array of lines. We initialise this like and dislike each as an array to make concatentation quicker at the end.
for p in pairs:
like,dislike = [],[]
3. Dealing with each line
With our row, we need to split it based on newlines, then split based on the pipes (|).
for l in p.split('\n'):
pair = l.split('|')
4. Pulling out each like and dislike
If the pair we are given has more than one value, then there must be a pair of likes or dislikes for us to capture. So append it to our like and dislike array - not the likes or dislikes as these hold our finally formatted strings. We also should perform a strip on these to remove any trailing or leading whitespace.
if len(pair) > 1:
# Not a blank line
like.append(pair[1].strip())
dislike.append(pair[2].strip())
5. Creating the final text
Once we are done processing the row we can join the strings with a single space, and can finally add these to our likes and dislikes array.
if len(like) > 0:
likes.append(" ".join(like))
if len(dislike) > 0:
dislikes.append(" ".join(dislike))
6. Using our new data structure
Now we can use these two new lists to process in anyway we choose, either printing each list separately...
from pprint import pprint
print "Likes:"
pprint(likes,indent=4)
print "Dislikes:"
pprint(dislikes,indent=4)
... or zip() them together to create a list of paired likes and dislikes!
print "A set of paired likes and dislikes"
pprint(zip(likes,dislikes),indent=4)
The complete code:
likes_and_dislikes="""
+------------------------------------+-----------------------------------+
| likes | dislikes |
+------------------------------------+-----------------------------------+
| Meritocracy | Favoritism, ass-kissing, politics |
+------------------------------------+-----------------------------------+
| Healthy debates and collaboration | Ego-driven rhetoric, drama and FUD|
| | to get one's way |
+------------------------------------+-----------------------------------+
| Autonomy given by confident leaders| Micro-management by insecure |
| capable of attracting top-tier | managers compensating for a weak, |
| talent | immature team |
+------------------------------------+-----------------------------------+ """
import re
likes,dislikes = [],[]
pairs = re.split("\+-*\+-*\+\n?",likes_and_dislikes)[2:-1] #Drop the header and the tail
for p in pairs:
like,dislike = [],[]
for l in p.split('\n'):
pair = l.split('|')
if len(pair) > 1:
# Not a blank line
like.append(pair[1].strip())
dislike.append(pair[2].strip())
if len(like) > 0:
likes.append(" ".join(like))
if len(dislike) > 0:
dislikes.append(" ".join(dislike))
from pprint import pprint
print "Likes:"
pprint(likes,indent=4)
print "Dislikes:"
pprint(dislikes,indent=4)
print "A set of paired likes and dislikes"
pprint(zip(likes,dislikes),indent=4)
This results in:
Likes:
[ 'Meritocracy',
'Healthy debates and collaboration ',
'Autonomy given by confident leaders capable of attracting top-tier talent']
Dislikes:
[ 'Favoritism, ass-kissing, politics',
"Ego-driven rhetoric, drama and FUD to get one's way",
'Micro-management by insecure managers compensating for a weak, immature team']
A set of paired likes and dislikes
[ ('Meritocracy', 'Favoritism, ass-kissing, politics'),
( 'Healthy debates and collaboration ',
"Ego-driven rhetoric, drama and FUD to get one's way"),
( 'Autonomy given by confident leaders capable of attracting top-tier talent',
'Micro-management by insecure managers compensating for a weak, immature team')]
You can see the complete code in action on codepad.

That's (one of) the table formats used in ReST (Restructured Text, a pythonic form of markup) and there are various parsers kicking around for it.
Here's one, on the old python.org site: http://legacy.python.org/scripts/ht2html/docutils/parsers/rst/tableparser.py

Related

Trying to place strings into columns

There are 3 columns, levels 1-3. A file is read, and each line of the file contains various data, including the level to which it belongs, located at the back of the string.
Sample lines from file being read:
thing_1 - level 1
thing_17 - level 3
thing_22 - level 2
I want to assign each "thing" to it's corresponding column. I have looked into pandas, but it would seem that DataFrame columns won't work, as passed data would need to have attributes that match the number of columns, where in my case, I need 3 columns, but each piece of data only has 1 data point.
How could I approach this problem?
Desired output:
level 1 level 2 level 3
thing_1 thing_22 thing_17
Edit:
In looking at suggestion, I can refine my question further. I have up to 3 columns, and the line from file needs to be assigned to one of the 3 columns. Most solutions seem to need something like:
data = [['Mary', 20], ['John', 57]]
columns = ['Name', 'Age']
This does not work for me, since there are 3 columns, and each piece of data goes into only one.
There's an additional wrinkle here that I didn't notice at first. If each of your levels has the same number of things, then you can build a dictionary and then use it to supply the table's columns to PrettyTable:
from prettytable import PrettyTable
# Create an empty dictionary.
levels = {}
with open('data.txt') as f:
for line in f:
# Remove trailing \n and split into the parts we want.
thing, level = line.rstrip('\n').split(' - ')
# If this is is a new level, set it to a list containing its thing.
if level not in levels:
levels[level] = [thing]
# Otherwise, add the new thing to the level's list.
else:
levels[level].append(thing)
# Create the table, and add each level as a column
table = PrettyTable()
for level, things in levels.items():
table.add_column(level, things)
print(table)
For the example data you showed, this prints:
+---------+----------+----------+
| level 1 | level 3 | level 2 |
+---------+----------+----------+
| thing_1 | thing_17 | thing_22 |
+---------+----------+----------+
The Complication
I probably wouldn't have posted an answer (believing it was covered sufficiently in this answer), except that I realized there's an unintuitive hurdle here. If your levels contain different numbers of things each, you get an error like this:
Exception: Column length 2 does not match number of rows 1!
Because none of the solutions readily available have an obvious, "automatic" solution to this, here is a simple way to do it. Build the dictionary as before, then:
# Find the length of the longest list of things.
longest = max(len(things) for things in levels.values())
table = PrettyTable()
for level, things in levels.items():
# Pad out the list if it's shorter than the longest.
things += ['-'] * (longest - len(things))
table.add_column(level, things)
print(table)
This will print something like this:
+---------+----------+----------+
| level 1 | level 3 | level 2 |
+---------+----------+----------+
| thing_1 | thing_17 | thing_22 |
| - | - | thing_5 |
+---------+----------+----------+
Extra
If all of that made sense and you'd like to know about a way part of it can be streamlined a little, take a look at Python's defaultdict. It can take care of the "check if this key already exists" process, providing a default (in this case a new list) if nothing's already there.
from collections import defaultdict
levels = defaultdict(list)
with open('data.txt') as f:
for line in f:
# Remove trailing \n and split into the parts we want.
thing, level = line.rstrip('\n').split(' - ')
# Automatically handles adding a new key if needed:
levels[level].append(thing)

Efficient query on a sorted csv

I have a .csv with several million rows. The first column is the id of each entry, and each id only occurs one time. The first column is sorted. Intuitively I'd say that it might be pretty easy to query this file efficiently using a divide and conquer algorithm. However, I couldn't find anything related to this.
Sample .csv file:
+----+------------------+-----+
| id | name | age |
+----+------------------+-----+
| 1 | John Cleese | 34 |
+----+------------------+-----+
| 3 | Mary Poppins | 35 |
+----+------------------+-----+
| .. | ... | .. |
+----+------------------+-----+
| 87 | Barry Zuckerkorn | 45 |
+----+------------------+-----+
I don't want to load the file in memory (too big), and I prefer to not use databases. I know I can just import this file in sqlite, but then I have multiple copies of this data, and I'd prefer to avoid that for multiple reasons.
Is there a good package I'm overlooking? Or is it something that I'd have to write myself?
Ok, my understanding is that you want some of the functionnalities of a light database, but are constrained to use a csv text file to hold the data. IMHO, this is probably a questionable design: past several hundred of rows, I would only see a csv file an an intermediate or exchange format.
As it is a very uncommon design, it is unlikely that a package for it already exists - for my part I know none. So I would imagine 2 possible ways: scan the file once and build an index id->row_position, and then use that index for your queries. Depending on the actual length of you rows, you could index only every n-th row to change speed for memory. But it costs an index file
An alternative way would be a direct divide and conquer algo: use stat/fstat to get the file size, and search for the next end of line starting at the middle of the file. You immediately get an id after it. If the id you want is that one, fine you have won, if it is greater, just recurse in the upper part, if lesser, recurse in the lower part. But because of the necessity to search for end of lines, be prepared to corner case like never finding the end of line in the expected range, or find it at the end.
After Serges answer I decided to write my own implementation, here it is. It doesn't allow newlines and doesn't deal with a lot of details regarding the .csv format. It assumes that the .csv is sorted on the first column, and that the first column are integer values.
import os
def query_sorted_csv(fname, id):
filesize = os.path.getsize(fname)
with open(fname) as fin:
row = look_for_id_at_location(fin, 0, filesize, id)
if not row:
raise Exception('id not found!')
return row
def look_for_id_at_location(fin, location_lower, location_upper, id, sep=',', id_column=0):
location = int((location_upper + location_lower) / 2)
if location_upper - location_lower < 2:
return False
fin.seek(location)
next(fin)
try:
full_line = next(fin)
except StopIteration:
return False
id_at_location = int(full_line.split(sep)[id_column])
if id_at_location == id:
return full_line
if id_at_location > id:
return look_for_id_at_location(fin, location_lower, location, id)
else:
return look_for_id_at_location(fin, location, location_upper, id)
row = query_sorted_csv('data.csv', 505)
You can look up about 4000 ids per second in a 2 million row 250MB .csv file. In comparison, you can look up 3 ids per second whilst looping over the entire file line by line.

python how to nicely align long text in pandas dataframe?

Given a panda's dataframe (received from a database) I'm trying to output the result to the console in such was that it will be complete and readable.
The challenge I have is with respect to the long text in 2 columns:LPQ_REASON & LPQ_RESOLUTION. You will note from the output below (print df) that both LPQ columns are ended with 3 dots (...) so I can't read the text. This comes despite my initial settings of:
pd.set_option('display.max_rows', 1500)
pd.set_option('display.max_columns', 1500)
pd.set_option('display.width', 1000)
so the result on the console looks like this:
ID DIS_CASE_ID CREATION_DATE type_2 LPQ_REASON LPQ_RESOLUTION RESOLUTION_CODE
0 727990 61180481 2017-01-05 13:47:05 7891 The LPQ we know is shorto add is 25% (h... This Memo was issued with conjunction to our j... 3979
1 727889 61180482 2017-01-05 13:51:09 7891 The LPQ he collide will increase 15% (h... This Memo was issued on matching viloation for... 3979
An optimal solution I'm looking for (if doable) is to print the entire line such that:
ID DIS_CASE_ID CREATION_DATE type_2 LPQ_REASON LPQ_RESOLUTION RESOLUTION_CODE
0 727990 61180481 2017-01-05 13:47:05 7891 The LPQ we know is shorto add is 25% (here This Memo was issued with conjunction to our 3979
comes the rest of the sentence. it might be analysis to foster a better bs when writing
long, or not, it might be short or whatever)
1 727889 61180482 2017-01-05 13:51:09 7891 The LPQ he collide will increase 15% yes and This Memo was issued on matching viloation for 3979
here I'm going to write the entire sentence who cares on what violation. just issued.
as if I really remember what was written. ha
Not as optimal as you want, but you can try the following:
pd.set_option('display.max_colwidth',100)
where 100 is the column width you can choose. but this will not create a multi-line cell but rather a very long column.
or:
not much elegant, but
you can try and use 'tabulate' library (https://pypi.python.org/pypi/tabulate) which creates nice text tables like:
+--------+-------+
| item | qty |
+========+=======+
| spam | 42 |
+--------+-------+
| eggs | 451 |
+--------+-------+
| bacon | 0 |
+--------+-------+
with tabulate you can use the '\n' new-line character.
just iterate over your text cells and put an '\n' every X characters (lets say every 50 characters).
a simple code for that:
for i in range(len(data)):
data.at[i,'text'] = data.at[i,'text'][0:50] + '\n' + data.at[i,'text'][50:]
the above is limited to only one line break, but you can improve it to make multi breaks for a long text. and also doesn't take into consideration whether it breaks in a middle of a word.
!Make sure to do that on a copy of the data, because it changes your data. and if you'll try to print it with a regular 'print' then you will see the '\n' stuck inside the middle of the text!

How to run pyspark code in distributed environment

I have 1 millions records and I want to try spark for this. I have list of items and want to perform lookup in records using this list items.
l = ['domestic',"private"]
text = ["On the domestic front, growth seems to have stalled, private investment and credit off-take is feeble, inflation seems to be bottoming out and turning upward, current account situation is not looking too promising, FPI inflows into debt and equity have slowed, and fiscal deficit situation of states is grim.", "Despite the aforementioned factors, rupee continues to remain strong against the USD and equities continue to outperform.", "This raises the question as to whether the asset prices are diverging from fundamentals and if so when are they expected to fall in line. We examine each of the above factors in a little more detail below.Q1FY18 growth numbers were disappointing with the GVA, or the gross value added, coming in at 5.6 percent. Market participants would be keen to ascertain whether the disappointing growth in Q1 was due to transitory factors such as demonetisation and GST or whether there are structural factors at play. There are silver linings such as a rise in core GVA (GVA excluding agri and public services), a rise in July IIP (at 1.2%), pickup in activity in the cash-intensive sectors, pick up in rail freight and containers handled by ports.However, there is a second school of thought as well, which suggests that growth slowdown could be structural. With demonetisation and rollout of GST, a number of informal industries have now been forced to enter the formal setup."]
res = {}
for rec in text:
for word in l:
if word in rec:
res[rec] = 1
break
print res
This is simple python script and same logic I want to execute using pyspark(Will this same code work?) in distributed manner to reduce the execution time.
Can you please guide me how to do this. I am sorry as I am very new to spark, you help will be much appereciated.
After instanciating a spark context and/or a spark session, you'll have to convert your list of records to a dataframe:
df = spark.createDataFrame(
sc.parallelize(
[[rec] for rec in text]
),
["text"]
)
df.show()
+--------------------+
| text|
+--------------------+
|On the domestic f...|
|Despite the afore...|
|This raises the q...|
+--------------------+
Now you can check for each line if words in l are present or not:
sc.broadcast(l)
res = df.withColumn("res", df.text.rlike('|'.join(l)).cast("int"))
res.show()
+--------------------+---+
| text|res|
+--------------------+---+
|On the domestic f...| 1|
|Despite the afore...| 0|
|This raises the q...| 0|
+--------------------+---+
rlike is for performing regex matching
sc.broadcast is for copying object l to every node so they don't have to go get it on the driver
Hope this helps

Merge two lists of objects containing lists

I have a directory tree containing html files called slides. Something like:
slides_root
|
|_slide-1
| |_slide-1.html
| |_slide-2.html
|
|_slide-2
| |
| |_slide-1
| | |_slide-1.html
| | |_slide-2.html
| | |_slide-3.html
| |
| |_slide-2
| |_slide-1.html
...and so on. They could go even deeper. Now imagine I have to replace some slides in this structure by merging it with another tree which is a subset of this.
WITH AN EXAMPLE: say that I want to replace slide-1.html and slide-3.html inside "slides_root/slide-2/slide-1" merging "slides_root" with:
slide_to_change
|
|_slide-2
|
|_slide-1
|_slide-1.html
|_slide-3.html
I would merge "slide_to_change" into "slides_root". The structure is the same so everything goes fine. But I have to do it in a python object representation of this scheme.
So the two trees are represented by two instances - slides1, slides2 - of the same "Slide" class which is structured as follows:
Slide(object):
def __init__(self, path):
self.path = path
self.slides = [Slide(path)]
Both slide1 and slide2 contains a path and a list that contain other Slide objects with other path and list of Slide objects and so on.
The rule is that if the the relative path is the same then I would replace the slide object in slide1 with the one in slide2.
How can achieve this result? It is really difficult and I can see no way out. Ideally something like:
for slide_root in slide1.slides:
for slide_dest in slide2.slides:
if slide_root.path == slide_dest.path:
slide_root = slide_dest
// now restart the loop at a deeper level
// repeat
Thank everyone for any answer.
Sounds not so complicated.
Just use a recursive function for walking the to-be-inserted tree and keep a hold on the corresponding place in the old tree.
If the parts match:
If the parts are both leafs (html thingies):
Insert (overwrite) the value.
If the parts are both nodes (slides):
Call yourself with the subslides (here's the recursion).
I know this is just kind of a hint, just kind of a sketch on how to do it. But maybe you want to start on this. In Python it could look sth like this (also not completely fleshed out):
def merge_slide(slide, old_slide):
for sub_slide in slide.slides:
sub_slide_position_in_old_slide = find_sub_slide_position_by_path(sub_slide.path)
if sub_slide_position_in_old_slide >= 0: # we found a match!
sub_slide_in_old_slide = old_slide.slides[sub_slide_position_in_old_slide]
if sub_slide.slides: # this is a node!
merge_slide(sub_slide, sub_slide_in_old_slide) # here we recurse
else: # this is a leaf! so we replace it:
old_slide[sub_slide_position_in_old_slide] = sub_slide
else: # nothing like this in old_slide
pass # ignore (you might want to consider this case!)
Maybe that gives you an idea on how I would approach this.

Categories

Resources