Tabulate according to terminal width in python? - python

I have some tabular data with some long fields. Pandas will cut off some of the long fields like this:
shortname title \
0 shc Shakespeare His Contemporaries
1 folger-shakespeare Folger Shakespeare Library Digital Texts
2 perseus-c-greek Perseus Canonical Greek
3 stanford-1880s Adult British Fiction of the 1880s, Assembled ...
4 reuters-21578 Reuters-21578
5 ecco-tcp Eighteenth Century Collections Online / Text C...
centuries
0 16th, 17th
1 16th, 17th
2 NaN
3 NaN
4 NaN
5 18th
and if I use tabulate.tabulate(), it looks like this:
- ------------------ -------------------------------------------------------------------------- ----------
0 shc Shakespeare His Contemporaries 16th, 17th
1 folger-shakespeare Folger Shakespeare Library Digital Texts 16th, 17th
2 perseus-c-greek Perseus Canonical Greek nan
3 stanford-1880s Adult British Fiction of the 1880s, Assembled by the Stanford Literary Lab nan
4 reuters-21578 Reuters-21578 nan
5 ecco-tcp Eighteenth Century Collections Online / Text Creation Partnership ECCO-TCP 18th
- ------------------ -------------------------------------------------------------------------- ----------
In the first case, the width is set to around 80, I'm guessing, and doesn't expand to fill the terminal window. I would like the columns "shortname," "title," and "centuries" to be on the same line, so this doesn't work.
In the second case, the width is set to the width of the data, but that won't work if there's a very long title, and if the user has a smaller terminal window, it will wrap really strangely.
So what I'm looking for is a (preferably easy) way in Python to pretty-print tabular data according to the user's terminal width, or at least allow me to specify the user's terminal width, which I will get elsewhere, like tabulate(data, 120) for 120 columns. Is there a way to do that?

I figured it out with a little poking around the pandas docs. This is what I'm doing now:
table = df[fields]
width = pandas.util.terminal.get_terminal_size() # find the width of the user's terminal window
pandas.set_option('display.width', width[0]) # set that as the max width in Pandas
print(table)

Related

Mapping data frame descriptions based on values of multiple columns

I need to generate a mapping dataframe with each unique code and a description I want prioritised, but need to do it based off a set of prioritisation options. So for example the starting dataframe might look like this:
Filename TB Period Company Code Desc. Amount
0 3 - Foxtrot... Prior TB FOXTROT FOXTROT__1000 98 100
1 3 - Foxtrot... Prior TB FOXTROT FOXTROT__1000 7 200
2 3 - Foxtrot... Opening TB FOXTROT FOXTROT__1000 ZX -100
3 3 - Foxtrot... Closing TB FOXTROT FOXTROT__1000 29 -200
4 3 - Foxtrot... Prior TB FOXTROT FOXTROT__1001 BA 100
5 3 - Foxtrot... Opening TB FOXTROT FOXTROT__1001 9 200
6 3 - Foxtrot... Closing TB FOXTROT FOXTROT__1001 ARC -100
7 3 - Foxtrot... Closing TB FOXTROT FOXTROT__1001 86 -200
The options I have for prioritisation of descriptions are:
Firstly to search for viable options in each Period, so for example Closing first, then if not found Opening, then if not found Prior.
If multiple descriptions are in the prioritised period, prioritise either longest or first instance.
So for example, if I wanted prioritisation of Closing, then Opening, then Prior, with longest string, I should get a mapping dataframe that looks like this:
Code New Desc.
FOXTROT__1000 29
FOXTROT__1001 ARC
Just for context, I have a fairly simple way to do all this in tkinter, but its dependent on generating a GUI of inconsistent codes and comboboxes of their descriptions, which is then used to generate a mapping dataframe.
The issue is that for large volumes (>1000 up to 30,000 inconsistent codes), it becomes impractical to generate a GUI, so for large volumes I need this as a way to auto-generate the mapping dataframe directly from the initial data whilst circumventing tkinter entirely.
import numpy as np
import pandas as df
#Create a new column which shows the hierarchy given the value of Period
df['NewFilterColumn'] = np.where( df['Period'] == 'Closing', 1,
np.where(df['Period'] == 'Opening', 2,
np.where(df['Period'] == 'Prior', 3, None
)
)
)
df = df.sort_values(by = ['NewFilterColumn', 'Code','New Desc.'], ascending = True, axis = 0)

Tabula py not reading all rows for PDFs with alternating colors for each row when Lattice is set to True

I am trying to extract all rows from the PDF attached here.
Here is the code I used:
def parse_latticepdf_pages(pdf):
pages = read_pdf(
pdf,
pages = "all",
guess = False,
lattice = True,
silent = True,
area = [43, 5, 568, 774],
pandas_options = {'header': None}
)
return pd.concat(pages)
parse_latticepdf_pages(pdf = "file.pdf")
The output shows only those rows which are in the grey background color. İt doesn't show rows with the white background color. How do I get all rows regardless of the color the rows are in?
Note: Initially I tried with stream = True, but that caused other problems where each line appears as a separate row and it is impossible to group the rows as needed. Hence, I set Lattice = True. Also, enabling and not enabling multiple_tables return the same issue.
I would appreciate any help regarding this. Thank you!
Not sure what's happening, but confirmed it works with multiple_tables=False option as the following:
In [41]: tabula.read_pdf(fname, pages=1, lattice=True, area = [43, 5, 568, 774], multiple_tables=False)
Out[41]:
[ Issued Date Permit No. ... Proposed Use Valuation
0 4/1/2019 P025361-032119 ... New office and restroom addition to existing\r... $45,000.00
1 4/12/2019 P025502-041219 ... Isolate chapel from fire damaged area 4000 sq.... $1,000.00
2 4/12/2019 P025487-041019 ... Interior finish-out for new meat market 2500\r... $35,000.00
3 4/15/2019 P025520-041519 ... New 8-unit apartment building 10,800 sq. ft. $350,000.00
4 4/25/2019 P025101-020719 ... New Five Story Hotel 93,501 sq. ft. $12,327,000.00
5 4/9/2019 P025475-040919 ... Mobile Home Placement 1216 sq. ft. $1,250.00
6 4/9/2019 P025477-040919 ... Mobile Home Placement 1216 sq. ft. $1,250.00
7 4/9/2019 P025479-040919 ... Mobile Home Placement 1216 sq. ft. $1,250.00
8 4/8/2019 P025459-040519 ... Build a carport. $1,000.00
[9 rows x 7 columns]]
It might cause another issue for page="all" though.
I managed to finally solve this. For this particular PDF format, it's better to use other python packages such as PyMuPDF. I had posted a similar question on another post in StackOverflow. I am posting the link here. Hope this helps others too struggling to find a solution to a problem similar to that mentioned in this post.
Data Wrangling of text extracted from PDF using PyMuPDF possible? (alternating colors for each row) - text positioned in the middle for each row

Pandas.read_csv issue

I trying to read the message from database, but under the class label can't really read same as CSV dataset.
messages = pandas.read_csv('bitcoin_reddit.csv', delimiter='\t',
names=["title","class"])
print (messages)
Under the class label the pandas only can read as NaN
The version of my CSV file
title,url,timestamp,class
"It's official! 1 Bitcoin = $10,000 USD",https://v.redd.it/e7io27rdgt001,29/11/2017 17:25,0
The last 3 months in 47 seconds.,https://v.redd.it/typ8fdslz3e01,4/2/2018 18:42,0
It's over 9000!!!,https://i.imgur.com/jyoZGyW.gifv,26/11/2017 20:55,1
Everyone who's trading BTC right now,http://cdn.mutually.com/wp-content/uploads/2017/06/08-19.jpg,7/1/2018 12:38,1
I hope James is doing well,https://i.redd.it/h4ngqma643101.jpg,1/12/2017 1:50,1
Weeeeeeee!,https://i.redd.it/iwl7vz69cea01.gif,17/1/2018 1:13,0
Bitcoin.. The King,https://i.redd.it/4tl0oustqed01.jpg,1/2/2018 5:46,1
Nothing can increase by that much and still be a good investment.,https://i.imgur.com/oWePY7q.jpg,14/12/2017 0:02,1
"This is why I want bitcoin to hit $10,000",https://i.redd.it/fhzsxgcv9nyz.jpg,18/11/2017 18:25,1
Bitcoin Doesn't Give a Fuck.,https://v.redd.it/ty2y74gawug01,18/2/2018 15:19,-1
Working Hard or Hardly Working?,https://i.redd.it/c2o6204tvc301.jpg,12/12/2017 12:49,1
The separator in your csv file is a comma, not a tab. And since , is the default, there is no need to define it.
However, names= defines custom names for the columns. Your header already provides these names, so passing the column names you are interested in to usecols is all you need then:
>>> pd.read_csv(file, usecols=['title', 'class'])
title class
0 It's official! 1 Bitcoin = $10,000 USD 0
1 The last 3 months in 47 seconds. 0
2 It's over 9000!!! 1
3 Everyone who's trading BTC right now 1
4 I hope James is doing well 1
5 Weeeeeeee! 0

filtering a column of string by list without doing exact match

I am having a pandas data frame like below:-
Tweets
0 RT #cizzorz: THE CHILLER TRAP *TEMPLE RUN* OBS...
1 Disco Domination receives a change in order to...
2 It's time for the Week 3 #FallSkirmish Trials!...
3 Dance your way to victory in the new Disco Dom...
4 Patch v6.02 is available now with a return fro...
5 Downtime for patch v6.02 has begun. Find out a...
6 💀⛏️... soon
7 Launch into patch v6.02 Wednesday, October 10!...
8 Righteous Fury.\n\nThe Wukong and Dark Vanguar...
9 RT #wbgames: WB Games is happy to bring #Fortn...
I also have a list suppose like below :-
my_list = ['Launch', 'Dance', 'Issue']
with below command it filters out the dataframe :-
ndata = data[data['Tweets'].str.contains( "|".join(my_list), regex=True)].reset_index(drop=True)
filter is not working if i am having
Working Not Working
Launch 'launch' , 'launch,' , 'Launch,' ,'LAUNCH','#launch'
Expected output should be sentence havign any of the below word
'launch' , 'launch,' , 'Launch,' ,'LAUNCH','#launch'
You need to make sure that contains ignores the case:
import re
.
.
.
ndata = data[data['Tweets'].str.contains("|".join(my_list), regex=True,
flags=re.IGNORECASE)].reset_index(drop=True)
# ^^^^^^^^^^^^^^^^^^^

Python 2.7 - pandas.read_table - how to import quadruple-pipe-separated fields from flat file

I am a decent SAS programmer, but I am quite new in Python. Now, I have been given Twitter feeds, each saved into very large flat files, with headers in row #1 and a data structure like the below:
CREATED_AT||||ID||||TEXT||||IN_REPLY_TO_USER_ID||||NAME||||SCREEN_NAME||||DESCRIPTION||||FOLLOWERS_COUNT||||TIME_ZONE||||QUOTE_COUNT||||REPLY_COUNT||||RETWEET_COUNT||||FAVORITE_COUNT
Tue Nov 14 12:33:00 +0000 2017||||930413253766791168||||ICYMI: Football clubs join the craft beer revolution! A good read|||| ||||BAB||||BABBrewers||||Monthly homebrew meet-up at 1000 Trades, Jewellery Quarter. First Tuesday of the month. All welcome, even if you've never brewed before.||||95|||| ||||0||||0||||0||||0
Tue Nov 14 12:34:00 +0000 2017||||930413253766821456||||I'm up for it|||| ||||Misty||||MistyGrl||||You CAN DO it!||||45|||| ||||0||||0||||0||||0
I guess it's like that because any sort of characters can be found in a Twitter feed, but a quadruple pipe is unlikely enough.
I know some people use JSON for that, but I've got these files as such: lots of them. I could use SAS to easily transform these files, but I prefer to "go pythonic", this time.
Now, I cannot seem to find a way to make Python (2.7) understand that the quadruple pipe is the actual separator. The output from the code below:
import pandas as pd
with open('C:/Users/myname.mysurname/Desktop/my_twitter_flow_1.txt') as theInFile:
inTbl = pd.read_table(theInFile, engine='python', sep='||||', header=1)
print inTbl.head()
seem to suggest that Python does not see the distinct fields as distinct but, simply, brings in each of the first 5 rows, up to the line feed character, ignoring the |||| separator.
Basically, I am getting an output like the one I wrote above to show you the data structure.
Any hints?
Using just the data in your question:
>>> df = pd.read_csv('rio.txt', sep='\|{4}', skip_blank_lines=True, engine='python')
>>> df
CREATED_AT ID \
0 Tue Nov 14 12:33:00 +0000 2017 930413253766791168
1 Tue Nov 14 12:34:00 +0000 2017 930413253766821456
TEXT IN_REPLY_TO_USER_ID \
0 ICYMI: Football clubs join the craft beer revo...
1 I'm up for it
NAME SCREEN_NAME DESCRIPTION \
0 BAB BABBrewers Monthly homebrew meet-up at 1000 Trades, Jewel...
1 Misty MistyGrl You CAN DO it!
FOLLOWERS_COUNT TIME_ZONE QUOTE_COUNT REPLY_COUNT RETWEET_COUNT \
0 95 0 0 0
1 45 0 0 0
FAVORITE_COUNT
0 0
1 0
Notice the sep parameter. When it's more than one character long and not equal to '\s+' it's interpreted as a regular expression. But the '|' character has special meaning in a regex, hence it must be escaped, using the '\' character. I could simply have written sep='\|\|\|\|'; however, I've used an abbreviation.

Categories

Resources