python how to nicely align long text in pandas dataframe? - python

Given a panda's dataframe (received from a database) I'm trying to output the result to the console in such was that it will be complete and readable.
The challenge I have is with respect to the long text in 2 columns:LPQ_REASON & LPQ_RESOLUTION. You will note from the output below (print df) that both LPQ columns are ended with 3 dots (...) so I can't read the text. This comes despite my initial settings of:
pd.set_option('display.max_rows', 1500)
pd.set_option('display.max_columns', 1500)
pd.set_option('display.width', 1000)
so the result on the console looks like this:
ID DIS_CASE_ID CREATION_DATE type_2 LPQ_REASON LPQ_RESOLUTION RESOLUTION_CODE
0 727990 61180481 2017-01-05 13:47:05 7891 The LPQ we know is shorto add is 25% (h... This Memo was issued with conjunction to our j... 3979
1 727889 61180482 2017-01-05 13:51:09 7891 The LPQ he collide will increase 15% (h... This Memo was issued on matching viloation for... 3979
An optimal solution I'm looking for (if doable) is to print the entire line such that:
ID DIS_CASE_ID CREATION_DATE type_2 LPQ_REASON LPQ_RESOLUTION RESOLUTION_CODE
0 727990 61180481 2017-01-05 13:47:05 7891 The LPQ we know is shorto add is 25% (here This Memo was issued with conjunction to our 3979
comes the rest of the sentence. it might be analysis to foster a better bs when writing
long, or not, it might be short or whatever)
1 727889 61180482 2017-01-05 13:51:09 7891 The LPQ he collide will increase 15% yes and This Memo was issued on matching viloation for 3979
here I'm going to write the entire sentence who cares on what violation. just issued.
as if I really remember what was written. ha

Not as optimal as you want, but you can try the following:
pd.set_option('display.max_colwidth',100)
where 100 is the column width you can choose. but this will not create a multi-line cell but rather a very long column.
or:
not much elegant, but
you can try and use 'tabulate' library (https://pypi.python.org/pypi/tabulate) which creates nice text tables like:
+--------+-------+
| item | qty |
+========+=======+
| spam | 42 |
+--------+-------+
| eggs | 451 |
+--------+-------+
| bacon | 0 |
+--------+-------+
with tabulate you can use the '\n' new-line character.
just iterate over your text cells and put an '\n' every X characters (lets say every 50 characters).
a simple code for that:
for i in range(len(data)):
data.at[i,'text'] = data.at[i,'text'][0:50] + '\n' + data.at[i,'text'][50:]
the above is limited to only one line break, but you can improve it to make multi breaks for a long text. and also doesn't take into consideration whether it breaks in a middle of a word.
!Make sure to do that on a copy of the data, because it changes your data. and if you'll try to print it with a regular 'print' then you will see the '\n' stuck inside the middle of the text!

Related

Trying to place strings into columns

There are 3 columns, levels 1-3. A file is read, and each line of the file contains various data, including the level to which it belongs, located at the back of the string.
Sample lines from file being read:
thing_1 - level 1
thing_17 - level 3
thing_22 - level 2
I want to assign each "thing" to it's corresponding column. I have looked into pandas, but it would seem that DataFrame columns won't work, as passed data would need to have attributes that match the number of columns, where in my case, I need 3 columns, but each piece of data only has 1 data point.
How could I approach this problem?
Desired output:
level 1 level 2 level 3
thing_1 thing_22 thing_17
Edit:
In looking at suggestion, I can refine my question further. I have up to 3 columns, and the line from file needs to be assigned to one of the 3 columns. Most solutions seem to need something like:
data = [['Mary', 20], ['John', 57]]
columns = ['Name', 'Age']
This does not work for me, since there are 3 columns, and each piece of data goes into only one.
There's an additional wrinkle here that I didn't notice at first. If each of your levels has the same number of things, then you can build a dictionary and then use it to supply the table's columns to PrettyTable:
from prettytable import PrettyTable
# Create an empty dictionary.
levels = {}
with open('data.txt') as f:
for line in f:
# Remove trailing \n and split into the parts we want.
thing, level = line.rstrip('\n').split(' - ')
# If this is is a new level, set it to a list containing its thing.
if level not in levels:
levels[level] = [thing]
# Otherwise, add the new thing to the level's list.
else:
levels[level].append(thing)
# Create the table, and add each level as a column
table = PrettyTable()
for level, things in levels.items():
table.add_column(level, things)
print(table)
For the example data you showed, this prints:
+---------+----------+----------+
| level 1 | level 3 | level 2 |
+---------+----------+----------+
| thing_1 | thing_17 | thing_22 |
+---------+----------+----------+
The Complication
I probably wouldn't have posted an answer (believing it was covered sufficiently in this answer), except that I realized there's an unintuitive hurdle here. If your levels contain different numbers of things each, you get an error like this:
Exception: Column length 2 does not match number of rows 1!
Because none of the solutions readily available have an obvious, "automatic" solution to this, here is a simple way to do it. Build the dictionary as before, then:
# Find the length of the longest list of things.
longest = max(len(things) for things in levels.values())
table = PrettyTable()
for level, things in levels.items():
# Pad out the list if it's shorter than the longest.
things += ['-'] * (longest - len(things))
table.add_column(level, things)
print(table)
This will print something like this:
+---------+----------+----------+
| level 1 | level 3 | level 2 |
+---------+----------+----------+
| thing_1 | thing_17 | thing_22 |
| - | - | thing_5 |
+---------+----------+----------+
Extra
If all of that made sense and you'd like to know about a way part of it can be streamlined a little, take a look at Python's defaultdict. It can take care of the "check if this key already exists" process, providing a default (in this case a new list) if nothing's already there.
from collections import defaultdict
levels = defaultdict(list)
with open('data.txt') as f:
for line in f:
# Remove trailing \n and split into the parts we want.
thing, level = line.rstrip('\n').split(' - ')
# Automatically handles adding a new key if needed:
levels[level].append(thing)

Efficient query on a sorted csv

I have a .csv with several million rows. The first column is the id of each entry, and each id only occurs one time. The first column is sorted. Intuitively I'd say that it might be pretty easy to query this file efficiently using a divide and conquer algorithm. However, I couldn't find anything related to this.
Sample .csv file:
+----+------------------+-----+
| id | name | age |
+----+------------------+-----+
| 1 | John Cleese | 34 |
+----+------------------+-----+
| 3 | Mary Poppins | 35 |
+----+------------------+-----+
| .. | ... | .. |
+----+------------------+-----+
| 87 | Barry Zuckerkorn | 45 |
+----+------------------+-----+
I don't want to load the file in memory (too big), and I prefer to not use databases. I know I can just import this file in sqlite, but then I have multiple copies of this data, and I'd prefer to avoid that for multiple reasons.
Is there a good package I'm overlooking? Or is it something that I'd have to write myself?
Ok, my understanding is that you want some of the functionnalities of a light database, but are constrained to use a csv text file to hold the data. IMHO, this is probably a questionable design: past several hundred of rows, I would only see a csv file an an intermediate or exchange format.
As it is a very uncommon design, it is unlikely that a package for it already exists - for my part I know none. So I would imagine 2 possible ways: scan the file once and build an index id->row_position, and then use that index for your queries. Depending on the actual length of you rows, you could index only every n-th row to change speed for memory. But it costs an index file
An alternative way would be a direct divide and conquer algo: use stat/fstat to get the file size, and search for the next end of line starting at the middle of the file. You immediately get an id after it. If the id you want is that one, fine you have won, if it is greater, just recurse in the upper part, if lesser, recurse in the lower part. But because of the necessity to search for end of lines, be prepared to corner case like never finding the end of line in the expected range, or find it at the end.
After Serges answer I decided to write my own implementation, here it is. It doesn't allow newlines and doesn't deal with a lot of details regarding the .csv format. It assumes that the .csv is sorted on the first column, and that the first column are integer values.
import os
def query_sorted_csv(fname, id):
filesize = os.path.getsize(fname)
with open(fname) as fin:
row = look_for_id_at_location(fin, 0, filesize, id)
if not row:
raise Exception('id not found!')
return row
def look_for_id_at_location(fin, location_lower, location_upper, id, sep=',', id_column=0):
location = int((location_upper + location_lower) / 2)
if location_upper - location_lower < 2:
return False
fin.seek(location)
next(fin)
try:
full_line = next(fin)
except StopIteration:
return False
id_at_location = int(full_line.split(sep)[id_column])
if id_at_location == id:
return full_line
if id_at_location > id:
return look_for_id_at_location(fin, location_lower, location, id)
else:
return look_for_id_at_location(fin, location, location_upper, id)
row = query_sorted_csv('data.csv', 505)
You can look up about 4000 ids per second in a 2 million row 250MB .csv file. In comparison, you can look up 3 ids per second whilst looping over the entire file line by line.

How to run pyspark code in distributed environment

I have 1 millions records and I want to try spark for this. I have list of items and want to perform lookup in records using this list items.
l = ['domestic',"private"]
text = ["On the domestic front, growth seems to have stalled, private investment and credit off-take is feeble, inflation seems to be bottoming out and turning upward, current account situation is not looking too promising, FPI inflows into debt and equity have slowed, and fiscal deficit situation of states is grim.", "Despite the aforementioned factors, rupee continues to remain strong against the USD and equities continue to outperform.", "This raises the question as to whether the asset prices are diverging from fundamentals and if so when are they expected to fall in line. We examine each of the above factors in a little more detail below.Q1FY18 growth numbers were disappointing with the GVA, or the gross value added, coming in at 5.6 percent. Market participants would be keen to ascertain whether the disappointing growth in Q1 was due to transitory factors such as demonetisation and GST or whether there are structural factors at play. There are silver linings such as a rise in core GVA (GVA excluding agri and public services), a rise in July IIP (at 1.2%), pickup in activity in the cash-intensive sectors, pick up in rail freight and containers handled by ports.However, there is a second school of thought as well, which suggests that growth slowdown could be structural. With demonetisation and rollout of GST, a number of informal industries have now been forced to enter the formal setup."]
res = {}
for rec in text:
for word in l:
if word in rec:
res[rec] = 1
break
print res
This is simple python script and same logic I want to execute using pyspark(Will this same code work?) in distributed manner to reduce the execution time.
Can you please guide me how to do this. I am sorry as I am very new to spark, you help will be much appereciated.
After instanciating a spark context and/or a spark session, you'll have to convert your list of records to a dataframe:
df = spark.createDataFrame(
sc.parallelize(
[[rec] for rec in text]
),
["text"]
)
df.show()
+--------------------+
| text|
+--------------------+
|On the domestic f...|
|Despite the afore...|
|This raises the q...|
+--------------------+
Now you can check for each line if words in l are present or not:
sc.broadcast(l)
res = df.withColumn("res", df.text.rlike('|'.join(l)).cast("int"))
res.show()
+--------------------+---+
| text|res|
+--------------------+---+
|On the domestic f...| 1|
|Despite the afore...| 0|
|This raises the q...| 0|
+--------------------+---+
rlike is for performing regex matching
sc.broadcast is for copying object l to every node so they don't have to go get it on the driver
Hope this helps

how to convert a value to hhmm format when it has no leading zeros in pig latin script

I am trying to find difference between two different time fields in pig relation . I can use todate() method of pig but for that it should be in hhmm format. However it does not have leading zeros. For example if the two field had value 1245 and 1425 I can find the difference converting them using todate. However if the value is 945 and 823 then I cannot convert using todate because there is no leading zero.
However I wrote a python udf attempting to leftpad a zero. Please find the code below
#outputSchema("time:bytearray")
def zero(time):
time = str(time)
if len(time)<= 3:
return '0'+ time
else:
return time
Step 1 : Registered my python function
REGISTER '/home/Jig13517/zeropad.py' using jython AS myfuncs ;
Please find the relation below
Airlines_data_schema = LOAD '/user/Jig13517/pigsample/Airlines_data.csv' USING PigStorage('\t') AS (Year,Month,DayofMonth,DayofWeek,DepTime_actual,CRSDeptime,Arrtime_actual,CRSArrtime,UniqueCarrier,FlightNum,TailNum_Plane,ActualElapsedTime,CRSElapsedTime,Airtime,Arrdelay,Depdelay,Origin,Dest,Distance,Taxiin,Taxiout,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay);
=====================================
Then I tried to leftpad the column value with zeros
airlines_new = FOREACH Airlines_data_schema GENERATE Year,Month,DayofMonth,DayofWeek,myfuncs.zero($4) AS DepTime_actual_new,myfuncs.zero($5) AS CRSDeptime_new,myfuncs.zero($6) AS Arrtime_actual_new,myfuncs.zero($7) AS CRSArrtime_new,UniqueCarrier,FlightNum,TailNum_Plane,ActualElapsedTime,CRSElapsedTime,Airtime,Arrdelay,Depdelay,Origin,Dest,Distance,Taxiin,Taxiout,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay ;
===============================
Sample data after application of python udf
(2008,1,3,4,617,615,652,650,WN,11,N689SW,95,95,70,2,2,IND,MCI,451,6,19,0,,0,NA,NA,NA,NA,NA,,,,None,None,None,None,,,,,,,,,,,,,,,,,,,,,)
But we can see above it is not converting the column value . I am getting the same fields unaltered. Please let me to know what is wrong with my udf or is there any any pig method to achieve this task.
The str.zfill function could help
input.txt
1245
1425
945
823
pig_udfs.py
#outputSchema('time:chararray')
def lpad_time(time):
return time.zfill(4)
time_formatter.pig
register pig_udfs.py using jython as myfuncs;
A = LOAD 'input.txt' USING PigStorage();
B = FOREACH A GENERATE myfuncs.lpad_time((chararray) $0);
\d B
Output
(1245)
(1425)
(0945)
(0823)
Obviously, you could make Python do the entire todate function itself...
Also, I wasn't clear in your question if the minutes were zero padded.
EDIT
airlines.csv
2008,1,3,4,617,615,652,650,WN,11,N689SW,95,95,70,2,2,IND,MCI,451,6,19,0,,0,NA,NA,NA,NA,NA,,,,None,None,None,None,,,,,,,,,,,,,,,,,,,,,
pig code
register pig_udfs.py using jython as myfuncs;
A = LOAD 'airlines.csv' USING PigStorage(',');
B = FOREACH A GENERATE $0 AS Year, $1 AS Month, $2 AS DayofMonth, $4 AS DayofWeek,myfuncs.lpad_time((chararray) $4) AS DepTime_actual_new,myfuncs.lpad_time((chararray) $5) AS CRSDeptime_new,myfuncs.lpad_time((chararray) $6) AS Arrtime_actual_new,myfuncs.lpad_time((chararray) $7) AS CRSArrtime_new,$8 AS UniqueCarrier,$9 AS FlightNum,$10 AS TailNum_Plane,$11 AS ActualElapsedTime, $12 AS CRSElapsedTime, $13 AS Airtime, $14 AS Arrdelay, $15 AS Depdelay, $16 AS Origin, $17 AS Dest, $18 AS Distance, $19 AS Taxiin, $20 AS Taxiout, $21 AS Cancelled, $22 AS CancellationCode, $23 AS Diverted, $24 AS CarrierDelay, $25 AS WeatherDelay, $26 AS NASDelay, $27 AS SecurityDelay, $28 AS LateAircraftDelay ;
\d B
Output
(2008,1,3,617,0617,0615,0652,0650,WN,11,N689SW,95,95,70,2,2,IND,MCI,451,6,19,0,,0,NA,NA,NA,NA,NA)
Hey #cricket_007 I got it working.I was passing the column fields as bytearray that was the mistake I was doing. Then when I changed the schema to chararray then it started padding zero. Thanks a lot.
Please find the corrected records below:
(2008,1,3,4,0617,0615,0652,0650,WN,11,N689SW,95,95,70,2,2,IND,MCI,451,6,19,0,,0,NA,NA,NA,NA,NA)
(2008,1,3,4,0628,0620,0804,0750,WN,448,N428WN,96,90,76,14,8,IND,BWI,515,3,17,0,,0,NA,NA,NA,NA,NA)

Parsing a pretty printed table into Python objects [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
Suppose there is a String called likes_and_dislikes, formatted visually as a table like that as shown below.
How can I parse the string and return a list of tuples with likes and dislikes. Also the top header(likes, dislikes) has to be removed from list of tuples.
likes_and_dislikes="""
+------------------------------------+-----------------------------------+
| likes | dislikes |
+------------------------------------+-----------------------------------+
| Meritocracy | Favoritism, ass-kissing, politics |
+------------------------------------+-----------------------------------+
| Healthy debates and collaboration | Ego-driven rhetoric, drama and FUD|
| | to get one's way |
+------------------------------------+-----------------------------------+
| Autonomy given by confident leaders| Micro-management by insecure |
| capable of attracting top-tier | managers compensating for a weak, |
| talent | immature team |
+------------------------------------+-----------------------------------+ """
The key here is to examine the table thoroughly and understand what you are trying to pull out.
First of all, parsing strings like this is generally easier when done line-by-line, so you need to split based on the table rows and then parse columns based on that. We do this primarily because the likes and dislikes span across lines.
1. Getting each row
We don't know how wide the table might be so we use regular expressions to break up our table like so:
pairs = re.split("\+-*\+-*\+\n?",likes_and_dislikes)[2:-1] #Drop the header and the tail
This gives us an array that correesponds to our multiline rows. The array slicing at the end removes the header and any trailing whitespace that we don't want to process. However, we still have the problem of pulling together strings that span multiple lines in a cell.
2. Finding a like and dislike
If we iterate through this array of rows, we know each row has a like and a dislike that spans an unknown array of lines. We initialise this like and dislike each as an array to make concatentation quicker at the end.
for p in pairs:
like,dislike = [],[]
3. Dealing with each line
With our row, we need to split it based on newlines, then split based on the pipes (|).
for l in p.split('\n'):
pair = l.split('|')
4. Pulling out each like and dislike
If the pair we are given has more than one value, then there must be a pair of likes or dislikes for us to capture. So append it to our like and dislike array - not the likes or dislikes as these hold our finally formatted strings. We also should perform a strip on these to remove any trailing or leading whitespace.
if len(pair) > 1:
# Not a blank line
like.append(pair[1].strip())
dislike.append(pair[2].strip())
5. Creating the final text
Once we are done processing the row we can join the strings with a single space, and can finally add these to our likes and dislikes array.
if len(like) > 0:
likes.append(" ".join(like))
if len(dislike) > 0:
dislikes.append(" ".join(dislike))
6. Using our new data structure
Now we can use these two new lists to process in anyway we choose, either printing each list separately...
from pprint import pprint
print "Likes:"
pprint(likes,indent=4)
print "Dislikes:"
pprint(dislikes,indent=4)
... or zip() them together to create a list of paired likes and dislikes!
print "A set of paired likes and dislikes"
pprint(zip(likes,dislikes),indent=4)
The complete code:
likes_and_dislikes="""
+------------------------------------+-----------------------------------+
| likes | dislikes |
+------------------------------------+-----------------------------------+
| Meritocracy | Favoritism, ass-kissing, politics |
+------------------------------------+-----------------------------------+
| Healthy debates and collaboration | Ego-driven rhetoric, drama and FUD|
| | to get one's way |
+------------------------------------+-----------------------------------+
| Autonomy given by confident leaders| Micro-management by insecure |
| capable of attracting top-tier | managers compensating for a weak, |
| talent | immature team |
+------------------------------------+-----------------------------------+ """
import re
likes,dislikes = [],[]
pairs = re.split("\+-*\+-*\+\n?",likes_and_dislikes)[2:-1] #Drop the header and the tail
for p in pairs:
like,dislike = [],[]
for l in p.split('\n'):
pair = l.split('|')
if len(pair) > 1:
# Not a blank line
like.append(pair[1].strip())
dislike.append(pair[2].strip())
if len(like) > 0:
likes.append(" ".join(like))
if len(dislike) > 0:
dislikes.append(" ".join(dislike))
from pprint import pprint
print "Likes:"
pprint(likes,indent=4)
print "Dislikes:"
pprint(dislikes,indent=4)
print "A set of paired likes and dislikes"
pprint(zip(likes,dislikes),indent=4)
This results in:
Likes:
[ 'Meritocracy',
'Healthy debates and collaboration ',
'Autonomy given by confident leaders capable of attracting top-tier talent']
Dislikes:
[ 'Favoritism, ass-kissing, politics',
"Ego-driven rhetoric, drama and FUD to get one's way",
'Micro-management by insecure managers compensating for a weak, immature team']
A set of paired likes and dislikes
[ ('Meritocracy', 'Favoritism, ass-kissing, politics'),
( 'Healthy debates and collaboration ',
"Ego-driven rhetoric, drama and FUD to get one's way"),
( 'Autonomy given by confident leaders capable of attracting top-tier talent',
'Micro-management by insecure managers compensating for a weak, immature team')]
You can see the complete code in action on codepad.
That's (one of) the table formats used in ReST (Restructured Text, a pythonic form of markup) and there are various parsers kicking around for it.
Here's one, on the old python.org site: http://legacy.python.org/scripts/ht2html/docutils/parsers/rst/tableparser.py

Categories

Resources