How to align strings in columns? - python

I am trying to print out a custom format but am facing an issue.
header = ['string', 'longer string', 'str']
header1, header2, header3 = header
data = ['string', 'str', 'longest string']
data1, data2, data3 = data
len1 = len(header1)
len2 = len(header2)
len3 = len(header3)
len_1 = len(data1)
len_2 = len(data2)
len_3 = len(data3)
un = len1 + len2 + len3 + len_1 + len_2 + len_3
un_c = '_' * un
print(f"{un_c}\n|{header1} |{header2} |{header3}| \n |{data1} |{data2} |{data3}|")
Output:
_____________________________________________
|string |longer string |str|
|string |str |longest string|
The output I want is this:
_______________________________________
|string |longer string |str |
|string |str |longest string|
I want it to work for all lengths of strings using the len to add extra spacing to each string to make it aligned, but I can't figure it out at all.

There is a package called tabulate this is very good for this (https://pypi.org/project/tabulate/). Similar post here.

Each cell is constructed according to the longest content, with additional spaces for any shortfall, printing a | at the beginning of each line, and the rest of the | is constructed using the end parameter of print
The content is placed in a nested list to facilitate looping, other ways of doing this are possible, the principle is the same and adding some content does not affect it
items = [
['string', 'longer string', 'str'],
['string', 'str', 'longest string'],
['longer string', 'str', 'longest string'],
]
length = [max([len(item[i]) for item in items]) for i in range(len(items[0]))]
max_length = sum(length)
print("_" * (max_length + 4))
for item in items:
print("|", end="")
for i in range(len(length)):
item_length = len(item[i])
if length[i] > len(item[i]):
print(item[i] + " " * (length[i] - item_length), end="|")
else:
print(item[i], end="|")
print()
OUTPUT:
____________________________________________
|string |longer string|str |
|string |str |longest string|
|longer string|str |longest string|

Do it in two parts. First, figure out the size of each column. Then, do the printing based on those sizes.
header = ['string','longer string','str']
data = ['string','str','longest string']
lines = [header] * 3 + [data] * 3
def getsizes(lines):
maxn = [0] * len(lines[0])
for row in lines:
for i,col in enumerate(row):
maxn[i] = max(maxn[i], len(col)+1)
return maxn
def maketable(lines):
sizes = getsizes(lines)
all = sum(sizes)
print('_'*(all+len(sizes)) )
for row in lines:
print('|',end='')
for width, col in zip( sizes, row ):
print( col.ljust(width), end='|' )
print()
maketable(lines)
Output:
_______________________________________
|string |longer string |str |
|string |longer string |str |
|string |longer string |str |
|string |str |longest string |
|string |str |longest string |
|string |str |longest string |
You could change it to build up a single string, if you need that.

It accept an arbitrary number of rows. Supposed each row has string-type terms.
def table(*rows, padding=2, sep='|'):
sep_middle = ' '*(padding//2) + sep + ' '*(padding//2)
template = '{{:{}}}'
col_sizes = [max(map(len, col)) for col in zip(*rows)]
table_template = sep_middle.join(map(template.format, col_sizes))
print('_' * (sum(col_sizes) + len(sep_middle)*(len(header)-1) + 2*len(sep) + 2*(len(sep)*padding//2)))
for line in (header, *rows):
print(sep + ' ' * (padding//2) + table_template.format(*line) + ' ' * (padding//2) + sep)
header = ['string', 'longer string', 'str', '21']
data1 = ['string', 'str', 'longest stringhfykhj', 'null']
data2 = ['this', 'is', 'a', 'test']
# test 1
table(header, data1, data2)
# test 2
table(header, data1, data2, padding=4, sep=':')
Output
# first test
________________________________________________________
| string | longer string | str | 21 |
| string | longer string | str | 21 |
| string | str | longest stringhfykhj | null |
| this | is | a | test |
# second test
________________________________________________________________
: string : longer string : str : 21 :
: string : longer string : str : 21 :
: string : str : longest stringhfykhj : null :
: this : is : a : test :

Related

Adding a newline to the printed data from an imported file

I made the following code, which imports a file and prints its content :
import pandas as pd
file = r"..\test.xlsx"
try:
df = pd.read_excel(file)
#print(df)
except OSError:
print("Impossible to read", file)
test =
df['Date'].map(str) + ' | ' \
+ df['Time'].map(str) + ' | ' \
+ df['Description'].map(str) + ' | ' \
+ '\n'
print(test)
The output is (Edit : I precise that it is printed in an html file) :
20/01 | 17:00 | Text description here1 17/01 | 11:00 | Text
description here2 16/01 | 16:32 | Text description here3 <- In orange
when the the "Urgence" is equal to 3
But what I want is :
20/01 | 17:00 | Text description here1
17/01 | 11:00 | Text description here2
16/01 | 16:32 | Text description here3
I added a new line at the end of my statement + '\n' but it doesn't seem to change anything. How should I proceed ? Thank you.
Edit : I believe that the problem comes from the fact that the entire file is printed, and not line by line so it doesn't add the newline to each line. So I made this code :
test = []
for index, row in df.iterrows():
x = row['Date'] + ' | ' + row['Description'] + '\n'
test.append(x)
print(test)
But the result is the same..
Try this:
test = df['Date'].map(str) + ' | ' +
df['Time'].map(str) + ' | ' +
df['Description'].map(str) + ' | '
list(map(lambda x: print(x), test))
I removed the end of the test string and added the print function.
Let me know if there is any problem :)

Parse ascii table header

So I need to parse this into dataframe or list:
tmp =
['+--------------+-----------------------------------------+',
'| Something to | Some header with subheader |',
'| watch or +-----------------+-----------------------+',
'| idk | First | another text again |',
'| | | with one more line |',
'| | +-----------------------+',
'| | | and this | how it be |',
'+--------------+-----------------+-----------------------+']
It is just txt table with strange header. I need to transform it to this:
['Something to watch or idk', 'Some header with subheader First', 'Some header with subheader another text again with one more line and this', 'Some header with subheader another text again with one more line how it be']
Here's my first solution that make me closer to victory (you can see the comments my tries):
pluses = [i for i, element in enumerate(tmp) if element[0] == '+']
tmp2 = tmp[pluses[0]:pluses[1]+1].copy()
table_str=''.join(tmp[pluses[0]:pluses[1]+1])
col=[[i for i, symbol in enumerate(line) if symbol == '+' or symbol == '|'] for line in tmp2]
tmp3=[]
strt = ''.join(tmp2.copy())
table_list = [l.strip().replace('\n', '') for l in re.split(r'\+[+-]+', strt) if l.strip()]
for row in table_list:
joined_row = ['' for _ in range(len(row))]
for lines in [line for line in row.split('||')]:
line_part = [i.strip() for i in lines.split('|') if i]
joined_row = [i + j for i, j in zip(joined_row, line_part)]
tmp3.append(joined_row)
here's out:
tmp3
out[4]:
[['Something to', 'Some header with subheader'],
['Something towatch or'],
['idk', 'First', 'another text again'],
['idk', 'First', 'another text againwith one more line'],
['idk'],
['', '', 'and this', 'how it be']]
Remains only join this in the right way but idk how to...
Here's addon:
We can locate pluses and splitters by this:
col=[[i for i, symbol in enumerate(line) if symbol == '+' or symbol == '|'] for line in tmp2]
[[0, 15, 57],
[0, 15, 57],
[0, 15, 33, 57],
[0, 15, 33, 57],
[0, 15, 33, 57],
[0, 15, 33, 57],
[0, 15, 33, 45, 57],
[0, 15, 33, 57]]
And then we can split or group by cell but idk how to too... Please help
Example No.2:
+----------+------------------------------------------------------------+---------------+----------------------------------+--------------------+-----------------------+
| Number | longtextveryveryloooooong | aaaaaaaaaaa | bbbbbbbbbbbbbbbbbb | dfsdfgsdfddd |qqqqqqqqqqqqqqqqqqqqqq |
| string | | | ccccccccccccccccccccc | affasdd as |qqqqqqqqqqqqqqqqqqqqqq |
| | | | eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee,| seeerrrr e, | dfsdfffffffffffff |
| | | | anothertext and something | percent | ttttttttttttttttt |
| | | | (nothingtodo), | | sssssssssssssssssssss |
| | | | and text | |zzzzzzzzzzzzzzzzzzzzzz |
| | | +----------------------------------+ | b rererereerr ppppppp |
| | | | all | longtext wit- | | |
| | | | |h many character| | |
+----------+------------------------------------------------------------+---------------+-----------------+----------------+--------------------+-----------------------+
You could do it recursively - parsing each "sub table" at a time:
def parse_table(table, header='', root='', table_len=None):
# store length of original table
if not table_len:
table_len = len(table)
# end of current "column"
col = table[0].find('+', 1)
rows = [
row for row in range(1, len(table))
if table[row].startswith('+')
and table[row][col] == '+'
]
row = rows[0]
# split "line" contents into columns
# end of "line" is either `+` or final `|`
end = col
num_cols = table[0].count('+')
if num_cols != table[1].count('|'):
end = table[1].rfind('|')
columns = (line[1:end].split('|') for line in table[1:row])
# rebuild each column appending to header
content = [
' '.join([header] + [line.strip() for line in lines]).strip()
for lines in zip(*columns)
]
# is there a table below?
if row + 2 < len(table):
header = content[-1]
# if we are not the last table - we are a header
if len(rows) > 1:
header = content.pop()
# if we are the first table in column - we are the root
if not root:
root = header
next_table = [line[:col + 1] for line in table[row:]]
content.extend(
parse_table(
next_table,
header=header,
root=root,
table_len=table_len
)
)
# is there a table to the right?
if col + 2 < len(table[0]):
# find start line of next table
row = next(
row for row, line in enumerate(table, start=-1)
if line[col] == '|'
)
next_table = [line[col:] for line in table[row:]]
# new top-level table - reset root
if len(next_table) == table_len:
root = ''
# next table on same level - reset header
if len(table) == len(next_table):
header = root
content.extend(
parse_table(
next_table,
header=header,
root=root,
table_len=table_len
)
)
return content
Output:
>>> parse_table(table)
['Something to watch or idk',
'Some header with subheader First',
'Some header with subheader another text again with one more line and this',
'Some header with subheader another text again with one more line how it be']
>>> parse_table(big_table)
['Number string',
'longtextveryveryloooooong',
'aaaaaaaaaaa',
'bbbbbbbbbbbbbbbbbb ccccccccccccccccccccc eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee, anothertext and something (nothingtodo), and text all',
'bbbbbbbbbbbbbbbbbb ccccccccccccccccccccc eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee, anothertext and something (nothingtodo), and text longtext wit- h many character',
'dfsdfgsdfddd affasdd as seeerrrr e, percent',
'qqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqq dfsdfffffffffffff ttttttttttttttttt sssssssssssssssssssss zzzzzzzzzzzzzzzzzzzzzz b rererereerr ppppppp']
>>> parse_table(planets)
['Planets Planet Sun (Solar) Earth Moon Mars',
'Planets R (km) 696000 6371 1737 3390',
'Planets mass (x 10^29 kg) 1989100000 5973.6 73.5 641.85']
As the input is in the format of a reStructuredText table, you could use the docutils table parser.
import docutils.parsers.rst.tableparser
from collections.abc import Iterable
def extract_texts(tds):
" recursively extract StringLists and join"
texts = []
for e in tds:
if isinstance(e, docutils.statemachine.StringList):
texts.append(' '.join([s.strip() for s in list(e) if s]))
break
if isinstance(e, Iterable):
texts.append(extract_texts(e))
return texts
>>> parser = docutils.parsers.rst.tableparser.GridTableParser()
>>> tds = parser.parse(docutils.statemachine.StringList(tmp))
>>> extract_texts(tds)
[[],
[],
[[['Something to watch or idk'], ['Some header with subheader']],
[['First'], ['another text again with one more line']],
[['and this | how it be']]]]
then flatten.
For a more general usage, it is interesting to give a look in tds (the structure returned by parse): some documentation there

Flatten nested array in Spark DataFrame

I'm reading in some JSON on the from:
{"a": [{"b": {"c": 1, "d": 2}}]}
That is, the array items are unnecessarily nested. Now, because this happens inside an array, the answers given in How to flatten a struct in a Spark dataframe? don't apply directly.
This is how the dataframe looks when parsed:
root
|-- a: array
| |-- element: struct
| | |-- b: struct
| | | |-- c: integer
| | | |-- d: integer
I'm looking to transform the dataframe into this:
root
|-- a: array
| |-- element: struct
| | |-- b_c: integer
| | |-- b_d: integer
How do I go about aliasing the columns inside the array to effectively unnest it?
You can use transform:
df2 = df.selectExpr("transform(a, x -> struct(x.b.c as b_c, x.b.d as b_d)) as a")
Using the method presented in the accepted answer I wrote a function to recursively unnest a dataframe (recursing into nested arrays as well):
from pyspark.sql.types import ArrayType, StructType
def flatten(df, sentinel="x"):
def _gen_flatten_expr(schema, indent, parents, last, transform=False):
def handle(field, last):
path = parents + (field.name,)
alias = (
" as "
+ "_".join(path[1:] if transform else path)
+ ("," if not last else "")
)
if isinstance(field.dataType, StructType):
yield from _gen_flatten_expr(
field.dataType, indent, path, last, transform
)
elif (
isinstance(field.dataType, ArrayType) and
isinstance(field.dataType.elementType, StructType)
):
yield indent, "transform("
yield indent + 1, ".".join(path) + ","
yield indent + 1, sentinel + " -> struct("
yield from _gen_flatten_expr(
field.dataType.elementType,
indent + 2,
(sentinel,),
True,
True
)
yield indent + 1, ")"
yield indent, ")" + alias
else:
yield (indent, ".".join(path) + alias)
try:
*fields, last_field = schema.fields
except ValueError:
pass
else:
for field in fields:
yield from handle(field, False)
yield from handle(last_field, last)
lines = []
for indent, line in _gen_flatten_expr(df.schema, 0, (), True):
spaces = " " * 4 * indent
lines.append(spaces + line)
expr = "struct(" + "\n".join(lines) + ") as " + sentinel
return df.selectExpr(expr).select(sentinel + ".*")
Simplified Approach:
from pyspark.sql.functions import col
def flatten_df(nested_df):
stack = [((), nested_df)]
columns = []
while len(stack) > 0:
parents, df = stack.pop()
flat_cols = [
col(".".join(parents + (c[0],))).alias("_".join(parents + (c[0],)))
for c in df.dtypes
if c[1][:6] != "struct"
]
nested_cols = [
c[0]
for c in df.dtypes
if c[1][:6] == "struct"
]
columns.extend(flat_cols)
for nested_col in nested_cols:
projected_df = df.select(nested_col + ".*")
stack.append((parents + (nested_col,), projected_df))
return nested_df.select(columns)
ref: https://learn.microsoft.com/en-us/azure/synapse-analytics/how-to-analyze-complex-schema

Translating an EBNF grammar to pyparsing give error

I am making a parser to convert a simple DSL into elasticsearch query. some of the possible queries are:
response:success
response:success AND extension:php OR extension:css
response:sucess AND (extension:php OR extension:css)
time >= 2020-01-09
time >= 2020-01-09 AND response:success OR os:windows
NOT reponse:success
response:success AND NOT os:windows
I have written the following EBNF grammar for this :
<expr> ::= <or>
<or> ::= <and> (" OR " <and>)*
<and> ::= <unary> ((" AND ") <unary>)*
<unary> ::= " NOT " <unary> | <equality>
<equality> ::= (<word> ":" <word>) | <comparison>
<comparison> ::= "(" <expr> ")" | (<word> (" > " | " >= " | " < " | " <= ") <word>)+
<word> ::= ("a" | "b" | "c" | "d" | "e" | "f" | "g"
| "h" | "i" | "j" | "k" | "l" | "m" | "n"
| "o" | "p" | "q" | "r" | "s" | "t" | "u"
| "v" | "w" | "x" | "y" | "z")+
The precdence of operators in the DSL is:
() > NOT > AND > OR
aslo exact mathing i.e ':' has higher precedence than comparison operators.
I believe the above grammar capture the idea of my DSL. I am having a difficult time translating it to pyparsing, this is what i have now:
from pyparsing import *
AND = Keyword('AND') | Keyword('and')
OR = Keyword('OR') | Keyword('or')
NOT = Keyword('NOT') | Keyword('not')
word = Word(printables, excludeChars=':')
expr = Forward()
expr << Or
Comparison = Literal('(') + expr + Literal(')') + OneOrMore(word + ( Literal('>') | Literal('>=') | Literal('<') | Literal('<=')) + word)
Equality = (word + Literal(':') + word) | Comparison
Unary = Forward()
Unary << (NOT + Unary) | Equality
And = Unary + ZeroOrMore(AND + Unary)
Or = And + ZeroOrMore(OR + And)
The error i get is :
Traceback (most recent call last):
File "qql.py", line 54, in <module>
expr << Or
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pyparsing.py", line 5006, in __lshift__
self.mayIndexError = self.expr.mayIndexError
AttributeError: type object 'Or' has no attribute 'mayIndexError'
I think its becuase i am unable to understand Forward() correctly.
Questions: how can i correctly translate the above grammar to pyparsing?
**EDIT **: when i changed the pyparsing code to:
AND = Keyword('AND')
OR = Keyword('OR')
NOT = Keyword('NOT')
word = Word(printables, excludeChars=':')
expr = Forward()
Comparison = Literal('(') + expr + Literal(')') + OneOrMore(word + ( Literal('>') | Literal('>=') | Literal('<') | Literal('<=')) + word)
Equality = (word + Literal(':') + word) | Comparison
Unary = Forward()
Unary << ((NOT + Unary) | Equality)
And = Unary + ZeroOrMore(AND) + Unary
Or = And + ZeroOrMore(OR + And)
expr << Or
Q = """response : 200 \
AND extesnion: php \
OR extension: css \
"""
print(expr.parseString(Q))
I get this output:
['response', ':', '200', 'AND', 'extesnion', ':', 'php']
why OR expression is not parsed?

Apache beam CombinePerKey(sum) is not summing correctly

I want to count the occurrence of every field 'size' I have in my data:
counts = (
lines
| 'convert_to_dict' >> beam.Map(process_data)
| 'Window' >> beam.WindowInto(window.FixedWindows(1*1),accumulation_mode=trigger.AccumulationMode.DISCARDING)
| 'GettinyCounty' >> beam.Map(lambda x: x['size'])
| 'PairWithOne' >> beam.Map(lambda x: (x, 1))
| 'Sum' >> beam.CombinePerKey(sum))
Printing the results on the console
def format_result(size_count):
(size, count) = size_count
out = {}
out["size"] = size
out["count"] = count
print("{}".format(out))
return str(out)
output = counts | 'format_result' >> beam.Map(format_result)
| 'encode' >> beam.Map(lambda x: x.encode('utf-8')).with_output_types(bytes)))
output | beam.io.WriteToPubSub(known_args.output_topic)
Here's what I get. I expected something like:
{'medium' : 155} , {'small' : 75},...}

Categories

Resources