I made the following code, which imports a file and prints its content :
import pandas as pd
file = r"..\test.xlsx"
try:
df = pd.read_excel(file)
#print(df)
except OSError:
print("Impossible to read", file)
test =
df['Date'].map(str) + ' | ' \
+ df['Time'].map(str) + ' | ' \
+ df['Description'].map(str) + ' | ' \
+ '\n'
print(test)
The output is (Edit : I precise that it is printed in an html file) :
20/01 | 17:00 | Text description here1 17/01 | 11:00 | Text
description here2 16/01 | 16:32 | Text description here3 <- In orange
when the the "Urgence" is equal to 3
But what I want is :
20/01 | 17:00 | Text description here1
17/01 | 11:00 | Text description here2
16/01 | 16:32 | Text description here3
I added a new line at the end of my statement + '\n' but it doesn't seem to change anything. How should I proceed ? Thank you.
Edit : I believe that the problem comes from the fact that the entire file is printed, and not line by line so it doesn't add the newline to each line. So I made this code :
test = []
for index, row in df.iterrows():
x = row['Date'] + ' | ' + row['Description'] + '\n'
test.append(x)
print(test)
But the result is the same..
Try this:
test = df['Date'].map(str) + ' | ' +
df['Time'].map(str) + ' | ' +
df['Description'].map(str) + ' | '
list(map(lambda x: print(x), test))
I removed the end of the test string and added the print function.
Let me know if there is any problem :)
So I need to parse this into dataframe or list:
tmp =
['+--------------+-----------------------------------------+',
'| Something to | Some header with subheader |',
'| watch or +-----------------+-----------------------+',
'| idk | First | another text again |',
'| | | with one more line |',
'| | +-----------------------+',
'| | | and this | how it be |',
'+--------------+-----------------+-----------------------+']
It is just txt table with strange header. I need to transform it to this:
['Something to watch or idk', 'Some header with subheader First', 'Some header with subheader another text again with one more line and this', 'Some header with subheader another text again with one more line how it be']
Here's my first solution that make me closer to victory (you can see the comments my tries):
pluses = [i for i, element in enumerate(tmp) if element[0] == '+']
tmp2 = tmp[pluses[0]:pluses[1]+1].copy()
table_str=''.join(tmp[pluses[0]:pluses[1]+1])
col=[[i for i, symbol in enumerate(line) if symbol == '+' or symbol == '|'] for line in tmp2]
tmp3=[]
strt = ''.join(tmp2.copy())
table_list = [l.strip().replace('\n', '') for l in re.split(r'\+[+-]+', strt) if l.strip()]
for row in table_list:
joined_row = ['' for _ in range(len(row))]
for lines in [line for line in row.split('||')]:
line_part = [i.strip() for i in lines.split('|') if i]
joined_row = [i + j for i, j in zip(joined_row, line_part)]
tmp3.append(joined_row)
here's out:
tmp3
out[4]:
[['Something to', 'Some header with subheader'],
['Something towatch or'],
['idk', 'First', 'another text again'],
['idk', 'First', 'another text againwith one more line'],
['idk'],
['', '', 'and this', 'how it be']]
Remains only join this in the right way but idk how to...
Here's addon:
We can locate pluses and splitters by this:
col=[[i for i, symbol in enumerate(line) if symbol == '+' or symbol == '|'] for line in tmp2]
[[0, 15, 57],
[0, 15, 57],
[0, 15, 33, 57],
[0, 15, 33, 57],
[0, 15, 33, 57],
[0, 15, 33, 57],
[0, 15, 33, 45, 57],
[0, 15, 33, 57]]
And then we can split or group by cell but idk how to too... Please help
Example No.2:
+----------+------------------------------------------------------------+---------------+----------------------------------+--------------------+-----------------------+
| Number | longtextveryveryloooooong | aaaaaaaaaaa | bbbbbbbbbbbbbbbbbb | dfsdfgsdfddd |qqqqqqqqqqqqqqqqqqqqqq |
| string | | | ccccccccccccccccccccc | affasdd as |qqqqqqqqqqqqqqqqqqqqqq |
| | | | eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee,| seeerrrr e, | dfsdfffffffffffff |
| | | | anothertext and something | percent | ttttttttttttttttt |
| | | | (nothingtodo), | | sssssssssssssssssssss |
| | | | and text | |zzzzzzzzzzzzzzzzzzzzzz |
| | | +----------------------------------+ | b rererereerr ppppppp |
| | | | all | longtext wit- | | |
| | | | |h many character| | |
+----------+------------------------------------------------------------+---------------+-----------------+----------------+--------------------+-----------------------+
You could do it recursively - parsing each "sub table" at a time:
def parse_table(table, header='', root='', table_len=None):
# store length of original table
if not table_len:
table_len = len(table)
# end of current "column"
col = table[0].find('+', 1)
rows = [
row for row in range(1, len(table))
if table[row].startswith('+')
and table[row][col] == '+'
]
row = rows[0]
# split "line" contents into columns
# end of "line" is either `+` or final `|`
end = col
num_cols = table[0].count('+')
if num_cols != table[1].count('|'):
end = table[1].rfind('|')
columns = (line[1:end].split('|') for line in table[1:row])
# rebuild each column appending to header
content = [
' '.join([header] + [line.strip() for line in lines]).strip()
for lines in zip(*columns)
]
# is there a table below?
if row + 2 < len(table):
header = content[-1]
# if we are not the last table - we are a header
if len(rows) > 1:
header = content.pop()
# if we are the first table in column - we are the root
if not root:
root = header
next_table = [line[:col + 1] for line in table[row:]]
content.extend(
parse_table(
next_table,
header=header,
root=root,
table_len=table_len
)
)
# is there a table to the right?
if col + 2 < len(table[0]):
# find start line of next table
row = next(
row for row, line in enumerate(table, start=-1)
if line[col] == '|'
)
next_table = [line[col:] for line in table[row:]]
# new top-level table - reset root
if len(next_table) == table_len:
root = ''
# next table on same level - reset header
if len(table) == len(next_table):
header = root
content.extend(
parse_table(
next_table,
header=header,
root=root,
table_len=table_len
)
)
return content
Output:
>>> parse_table(table)
['Something to watch or idk',
'Some header with subheader First',
'Some header with subheader another text again with one more line and this',
'Some header with subheader another text again with one more line how it be']
>>> parse_table(big_table)
['Number string',
'longtextveryveryloooooong',
'aaaaaaaaaaa',
'bbbbbbbbbbbbbbbbbb ccccccccccccccccccccc eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee, anothertext and something (nothingtodo), and text all',
'bbbbbbbbbbbbbbbbbb ccccccccccccccccccccc eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee, anothertext and something (nothingtodo), and text longtext wit- h many character',
'dfsdfgsdfddd affasdd as seeerrrr e, percent',
'qqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqq dfsdfffffffffffff ttttttttttttttttt sssssssssssssssssssss zzzzzzzzzzzzzzzzzzzzzz b rererereerr ppppppp']
>>> parse_table(planets)
['Planets Planet Sun (Solar) Earth Moon Mars',
'Planets R (km) 696000 6371 1737 3390',
'Planets mass (x 10^29 kg) 1989100000 5973.6 73.5 641.85']
As the input is in the format of a reStructuredText table, you could use the docutils table parser.
import docutils.parsers.rst.tableparser
from collections.abc import Iterable
def extract_texts(tds):
" recursively extract StringLists and join"
texts = []
for e in tds:
if isinstance(e, docutils.statemachine.StringList):
texts.append(' '.join([s.strip() for s in list(e) if s]))
break
if isinstance(e, Iterable):
texts.append(extract_texts(e))
return texts
>>> parser = docutils.parsers.rst.tableparser.GridTableParser()
>>> tds = parser.parse(docutils.statemachine.StringList(tmp))
>>> extract_texts(tds)
[[],
[],
[[['Something to watch or idk'], ['Some header with subheader']],
[['First'], ['another text again with one more line']],
[['and this | how it be']]]]
then flatten.
For a more general usage, it is interesting to give a look in tds (the structure returned by parse): some documentation there
I'm reading in some JSON on the from:
{"a": [{"b": {"c": 1, "d": 2}}]}
That is, the array items are unnecessarily nested. Now, because this happens inside an array, the answers given in How to flatten a struct in a Spark dataframe? don't apply directly.
This is how the dataframe looks when parsed:
root
|-- a: array
| |-- element: struct
| | |-- b: struct
| | | |-- c: integer
| | | |-- d: integer
I'm looking to transform the dataframe into this:
root
|-- a: array
| |-- element: struct
| | |-- b_c: integer
| | |-- b_d: integer
How do I go about aliasing the columns inside the array to effectively unnest it?
You can use transform:
df2 = df.selectExpr("transform(a, x -> struct(x.b.c as b_c, x.b.d as b_d)) as a")
Using the method presented in the accepted answer I wrote a function to recursively unnest a dataframe (recursing into nested arrays as well):
from pyspark.sql.types import ArrayType, StructType
def flatten(df, sentinel="x"):
def _gen_flatten_expr(schema, indent, parents, last, transform=False):
def handle(field, last):
path = parents + (field.name,)
alias = (
" as "
+ "_".join(path[1:] if transform else path)
+ ("," if not last else "")
)
if isinstance(field.dataType, StructType):
yield from _gen_flatten_expr(
field.dataType, indent, path, last, transform
)
elif (
isinstance(field.dataType, ArrayType) and
isinstance(field.dataType.elementType, StructType)
):
yield indent, "transform("
yield indent + 1, ".".join(path) + ","
yield indent + 1, sentinel + " -> struct("
yield from _gen_flatten_expr(
field.dataType.elementType,
indent + 2,
(sentinel,),
True,
True
)
yield indent + 1, ")"
yield indent, ")" + alias
else:
yield (indent, ".".join(path) + alias)
try:
*fields, last_field = schema.fields
except ValueError:
pass
else:
for field in fields:
yield from handle(field, False)
yield from handle(last_field, last)
lines = []
for indent, line in _gen_flatten_expr(df.schema, 0, (), True):
spaces = " " * 4 * indent
lines.append(spaces + line)
expr = "struct(" + "\n".join(lines) + ") as " + sentinel
return df.selectExpr(expr).select(sentinel + ".*")
Simplified Approach:
from pyspark.sql.functions import col
def flatten_df(nested_df):
stack = [((), nested_df)]
columns = []
while len(stack) > 0:
parents, df = stack.pop()
flat_cols = [
col(".".join(parents + (c[0],))).alias("_".join(parents + (c[0],)))
for c in df.dtypes
if c[1][:6] != "struct"
]
nested_cols = [
c[0]
for c in df.dtypes
if c[1][:6] == "struct"
]
columns.extend(flat_cols)
for nested_col in nested_cols:
projected_df = df.select(nested_col + ".*")
stack.append((parents + (nested_col,), projected_df))
return nested_df.select(columns)
ref: https://learn.microsoft.com/en-us/azure/synapse-analytics/how-to-analyze-complex-schema
I am making a parser to convert a simple DSL into elasticsearch query. some of the possible queries are:
response:success
response:success AND extension:php OR extension:css
response:sucess AND (extension:php OR extension:css)
time >= 2020-01-09
time >= 2020-01-09 AND response:success OR os:windows
NOT reponse:success
response:success AND NOT os:windows
I have written the following EBNF grammar for this :
<expr> ::= <or>
<or> ::= <and> (" OR " <and>)*
<and> ::= <unary> ((" AND ") <unary>)*
<unary> ::= " NOT " <unary> | <equality>
<equality> ::= (<word> ":" <word>) | <comparison>
<comparison> ::= "(" <expr> ")" | (<word> (" > " | " >= " | " < " | " <= ") <word>)+
<word> ::= ("a" | "b" | "c" | "d" | "e" | "f" | "g"
| "h" | "i" | "j" | "k" | "l" | "m" | "n"
| "o" | "p" | "q" | "r" | "s" | "t" | "u"
| "v" | "w" | "x" | "y" | "z")+
The precdence of operators in the DSL is:
() > NOT > AND > OR
aslo exact mathing i.e ':' has higher precedence than comparison operators.
I believe the above grammar capture the idea of my DSL. I am having a difficult time translating it to pyparsing, this is what i have now:
from pyparsing import *
AND = Keyword('AND') | Keyword('and')
OR = Keyword('OR') | Keyword('or')
NOT = Keyword('NOT') | Keyword('not')
word = Word(printables, excludeChars=':')
expr = Forward()
expr << Or
Comparison = Literal('(') + expr + Literal(')') + OneOrMore(word + ( Literal('>') | Literal('>=') | Literal('<') | Literal('<=')) + word)
Equality = (word + Literal(':') + word) | Comparison
Unary = Forward()
Unary << (NOT + Unary) | Equality
And = Unary + ZeroOrMore(AND + Unary)
Or = And + ZeroOrMore(OR + And)
The error i get is :
Traceback (most recent call last):
File "qql.py", line 54, in <module>
expr << Or
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pyparsing.py", line 5006, in __lshift__
self.mayIndexError = self.expr.mayIndexError
AttributeError: type object 'Or' has no attribute 'mayIndexError'
I think its becuase i am unable to understand Forward() correctly.
Questions: how can i correctly translate the above grammar to pyparsing?
**EDIT **: when i changed the pyparsing code to:
AND = Keyword('AND')
OR = Keyword('OR')
NOT = Keyword('NOT')
word = Word(printables, excludeChars=':')
expr = Forward()
Comparison = Literal('(') + expr + Literal(')') + OneOrMore(word + ( Literal('>') | Literal('>=') | Literal('<') | Literal('<=')) + word)
Equality = (word + Literal(':') + word) | Comparison
Unary = Forward()
Unary << ((NOT + Unary) | Equality)
And = Unary + ZeroOrMore(AND) + Unary
Or = And + ZeroOrMore(OR + And)
expr << Or
Q = """response : 200 \
AND extesnion: php \
OR extension: css \
"""
print(expr.parseString(Q))
I get this output:
['response', ':', '200', 'AND', 'extesnion', ':', 'php']
why OR expression is not parsed?
I want to count the occurrence of every field 'size' I have in my data:
counts = (
lines
| 'convert_to_dict' >> beam.Map(process_data)
| 'Window' >> beam.WindowInto(window.FixedWindows(1*1),accumulation_mode=trigger.AccumulationMode.DISCARDING)
| 'GettinyCounty' >> beam.Map(lambda x: x['size'])
| 'PairWithOne' >> beam.Map(lambda x: (x, 1))
| 'Sum' >> beam.CombinePerKey(sum))
Printing the results on the console
def format_result(size_count):
(size, count) = size_count
out = {}
out["size"] = size
out["count"] = count
print("{}".format(out))
return str(out)
output = counts | 'format_result' >> beam.Map(format_result)
| 'encode' >> beam.Map(lambda x: x.encode('utf-8')).with_output_types(bytes)))
output | beam.io.WriteToPubSub(known_args.output_topic)
Here's what I get. I expected something like:
{'medium' : 155} , {'small' : 75},...}