Apache beam CombinePerKey(sum) is not summing correctly

Apache beam CombinePerKey(sum) is not summing correctly - python

I want to count the occurrence of every field 'size' I have in my data:
counts = (
lines
| 'convert_to_dict' >> beam.Map(process_data)
| 'Window' >> beam.WindowInto(window.FixedWindows(1*1),accumulation_mode=trigger.AccumulationMode.DISCARDING)
| 'GettinyCounty' >> beam.Map(lambda x: x['size'])
| 'PairWithOne' >> beam.Map(lambda x: (x, 1))
| 'Sum' >> beam.CombinePerKey(sum))
Printing the results on the console
def format_result(size_count):
(size, count) = size_count
out = {}
out["size"] = size
out["count"] = count
print("{}".format(out))
return str(out)
output = counts | 'format_result' >> beam.Map(format_result)
| 'encode' >> beam.Map(lambda x: x.encode('utf-8')).with_output_types(bytes)))
output | beam.io.WriteToPubSub(known_args.output_topic)
Here's what I get. I expected something like:
{'medium' : 155} , {'small' : 75},...}

Related

Adding a newline to the printed data from an imported file

I made the following code, which imports a file and prints its content :
import pandas as pd
file = r"..\test.xlsx"
try:
df = pd.read_excel(file)
#print(df)
except OSError:
print("Impossible to read", file)
test =
df['Date'].map(str) + ' | ' \
+ df['Time'].map(str) + ' | ' \
+ df['Description'].map(str) + ' | ' \
+ '\n'
print(test)
The output is (Edit : I precise that it is printed in an html file) :
20/01 | 17:00 | Text description here1 17/01 | 11:00 | Text
description here2 16/01 | 16:32 | Text description here3 <- In orange
when the the "Urgence" is equal to 3
But what I want is :
20/01 | 17:00 | Text description here1
17/01 | 11:00 | Text description here2
16/01 | 16:32 | Text description here3
I added a new line at the end of my statement + '\n' but it doesn't seem to change anything. How should I proceed ? Thank you.
Edit : I believe that the problem comes from the fact that the entire file is printed, and not line by line so it doesn't add the newline to each line. So I made this code :
test = []
for index, row in df.iterrows():
x = row['Date'] + ' | ' + row['Description'] + '\n'
test.append(x)
print(test)
But the result is the same..

Try this:
test = df['Date'].map(str) + ' | ' +
df['Time'].map(str) + ' | ' +
df['Description'].map(str) + ' | '
list(map(lambda x: print(x), test))
I removed the end of the test string and added the print function.
Let me know if there is any problem :)

How to align strings in columns?

I am trying to print out a custom format but am facing an issue.
header = ['string', 'longer string', 'str']
header1, header2, header3 = header
data = ['string', 'str', 'longest string']
data1, data2, data3 = data
len1 = len(header1)
len2 = len(header2)
len3 = len(header3)
len_1 = len(data1)
len_2 = len(data2)
len_3 = len(data3)
un = len1 + len2 + len3 + len_1 + len_2 + len_3
un_c = '_' * un
print(f"{un_c}\n|{header1} |{header2} |{header3}| \n |{data1} |{data2} |{data3}|")
Output:
_____________________________________________
|string |longer string |str|
|string |str |longest string|
The output I want is this:
_______________________________________
|string |longer string |str |
|string |str |longest string|
I want it to work for all lengths of strings using the len to add extra spacing to each string to make it aligned, but I can't figure it out at all.

There is a package called tabulate this is very good for this (https://pypi.org/project/tabulate/). Similar post here.

Each cell is constructed according to the longest content, with additional spaces for any shortfall, printing a | at the beginning of each line, and the rest of the | is constructed using the end parameter of print
The content is placed in a nested list to facilitate looping, other ways of doing this are possible, the principle is the same and adding some content does not affect it
items = [
['string', 'longer string', 'str'],
['string', 'str', 'longest string'],
['longer string', 'str', 'longest string'],
]
length = [max([len(item[i]) for item in items]) for i in range(len(items[0]))]
max_length = sum(length)
print("_" * (max_length + 4))
for item in items:
print("|", end="")
for i in range(len(length)):
item_length = len(item[i])
if length[i] > len(item[i]):
print(item[i] + " " * (length[i] - item_length), end="|")
else:
print(item[i], end="|")
print()
OUTPUT:
____________________________________________
|string |longer string|str |
|string |str |longest string|
|longer string|str |longest string|

Do it in two parts. First, figure out the size of each column. Then, do the printing based on those sizes.
header = ['string','longer string','str']
data = ['string','str','longest string']
lines = [header] * 3 + [data] * 3
def getsizes(lines):
maxn = [0] * len(lines[0])
for row in lines:
for i,col in enumerate(row):
maxn[i] = max(maxn[i], len(col)+1)
return maxn
def maketable(lines):
sizes = getsizes(lines)
all = sum(sizes)
print('_'*(all+len(sizes)) )
for row in lines:
print('|',end='')
for width, col in zip( sizes, row ):
print( col.ljust(width), end='|' )
print()
maketable(lines)
Output:
_______________________________________
|string |longer string |str |
|string |longer string |str |
|string |longer string |str |
|string |str |longest string |
|string |str |longest string |
|string |str |longest string |
You could change it to build up a single string, if you need that.

It accept an arbitrary number of rows. Supposed each row has string-type terms.
def table(*rows, padding=2, sep='|'):
sep_middle = ' '*(padding//2) + sep + ' '*(padding//2)
template = '{{:{}}}'
col_sizes = [max(map(len, col)) for col in zip(*rows)]
table_template = sep_middle.join(map(template.format, col_sizes))
print('_' * (sum(col_sizes) + len(sep_middle)*(len(header)-1) + 2*len(sep) + 2*(len(sep)*padding//2)))
for line in (header, *rows):
print(sep + ' ' * (padding//2) + table_template.format(*line) + ' ' * (padding//2) + sep)
header = ['string', 'longer string', 'str', '21']
data1 = ['string', 'str', 'longest stringhfykhj', 'null']
data2 = ['this', 'is', 'a', 'test']
# test 1
table(header, data1, data2)
# test 2
table(header, data1, data2, padding=4, sep=':')
Output
# first test
________________________________________________________
| string | longer string | str | 21 |
| string | longer string | str | 21 |
| string | str | longest stringhfykhj | null |
| this | is | a | test |
# second test
________________________________________________________________
: string : longer string : str : 21 :
: string : longer string : str : 21 :
: string : str : longest stringhfykhj : null :
: this : is : a : test :

Parse ascii table header

So I need to parse this into dataframe or list:
tmp =
['+--------------+-----------------------------------------+',
'| Something to | Some header with subheader |',
'| watch or +-----------------+-----------------------+',
'| idk | First | another text again |',
'| | | with one more line |',
'| | +-----------------------+',
'| | | and this | how it be |',
'+--------------+-----------------+-----------------------+']
It is just txt table with strange header. I need to transform it to this:
['Something to watch or idk', 'Some header with subheader First', 'Some header with subheader another text again with one more line and this', 'Some header with subheader another text again with one more line how it be']
Here's my first solution that make me closer to victory (you can see the comments my tries):
pluses = [i for i, element in enumerate(tmp) if element[0] == '+']
tmp2 = tmp[pluses[0]:pluses[1]+1].copy()
table_str=''.join(tmp[pluses[0]:pluses[1]+1])
col=[[i for i, symbol in enumerate(line) if symbol == '+' or symbol == '|'] for line in tmp2]
tmp3=[]
strt = ''.join(tmp2.copy())
table_list = [l.strip().replace('\n', '') for l in re.split(r'\+[+-]+', strt) if l.strip()]
for row in table_list:
joined_row = ['' for _ in range(len(row))]
for lines in [line for line in row.split('||')]:
line_part = [i.strip() for i in lines.split('|') if i]
joined_row = [i + j for i, j in zip(joined_row, line_part)]
tmp3.append(joined_row)
here's out:
tmp3
out[4]:
[['Something to', 'Some header with subheader'],
['Something towatch or'],
['idk', 'First', 'another text again'],
['idk', 'First', 'another text againwith one more line'],
['idk'],
['', '', 'and this', 'how it be']]
Remains only join this in the right way but idk how to...
Here's addon:
We can locate pluses and splitters by this:
col=[[i for i, symbol in enumerate(line) if symbol == '+' or symbol == '|'] for line in tmp2]
[[0, 15, 57],
[0, 15, 57],
[0, 15, 33, 57],
[0, 15, 33, 57],
[0, 15, 33, 57],
[0, 15, 33, 57],
[0, 15, 33, 45, 57],
[0, 15, 33, 57]]
And then we can split or group by cell but idk how to too... Please help
Example No.2:
+----------+------------------------------------------------------------+---------------+----------------------------------+--------------------+-----------------------+
| Number | longtextveryveryloooooong | aaaaaaaaaaa | bbbbbbbbbbbbbbbbbb | dfsdfgsdfddd |qqqqqqqqqqqqqqqqqqqqqq |
| string | | | ccccccccccccccccccccc | affasdd as |qqqqqqqqqqqqqqqqqqqqqq |
| | | | eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee,| seeerrrr e, | dfsdfffffffffffff |
| | | | anothertext and something | percent | ttttttttttttttttt |
| | | | (nothingtodo), | | sssssssssssssssssssss |
| | | | and text | |zzzzzzzzzzzzzzzzzzzzzz |
| | | +----------------------------------+ | b rererereerr ppppppp |
| | | | all | longtext wit- | | |
| | | | |h many character| | |
+----------+------------------------------------------------------------+---------------+-----------------+----------------+--------------------+-----------------------+

You could do it recursively - parsing each "sub table" at a time:
def parse_table(table, header='', root='', table_len=None):
# store length of original table
if not table_len:
table_len = len(table)
# end of current "column"
col = table[0].find('+', 1)
rows = [
row for row in range(1, len(table))
if table[row].startswith('+')
and table[row][col] == '+'
]
row = rows[0]
# split "line" contents into columns
# end of "line" is either `+` or final `|`
end = col
num_cols = table[0].count('+')
if num_cols != table[1].count('|'):
end = table[1].rfind('|')
columns = (line[1:end].split('|') for line in table[1:row])
# rebuild each column appending to header
content = [
' '.join([header] + [line.strip() for line in lines]).strip()
for lines in zip(*columns)
]
# is there a table below?
if row + 2 < len(table):
header = content[-1]
# if we are not the last table - we are a header
if len(rows) > 1:
header = content.pop()
# if we are the first table in column - we are the root
if not root:
root = header
next_table = [line[:col + 1] for line in table[row:]]
content.extend(
parse_table(
next_table,
header=header,
root=root,
table_len=table_len
)
)
# is there a table to the right?
if col + 2 < len(table[0]):
# find start line of next table
row = next(
row for row, line in enumerate(table, start=-1)
if line[col] == '|'
)
next_table = [line[col:] for line in table[row:]]
# new top-level table - reset root
if len(next_table) == table_len:
root = ''
# next table on same level - reset header
if len(table) == len(next_table):
header = root
content.extend(
parse_table(
next_table,
header=header,
root=root,
table_len=table_len
)
)
return content
Output:
>>> parse_table(table)
['Something to watch or idk',
'Some header with subheader First',
'Some header with subheader another text again with one more line and this',
'Some header with subheader another text again with one more line how it be']
>>> parse_table(big_table)
['Number string',
'longtextveryveryloooooong',
'aaaaaaaaaaa',
'bbbbbbbbbbbbbbbbbb ccccccccccccccccccccc eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee, anothertext and something (nothingtodo), and text all',
'bbbbbbbbbbbbbbbbbb ccccccccccccccccccccc eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee, anothertext and something (nothingtodo), and text longtext wit- h many character',
'dfsdfgsdfddd affasdd as seeerrrr e, percent',
'qqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqq dfsdfffffffffffff ttttttttttttttttt sssssssssssssssssssss zzzzzzzzzzzzzzzzzzzzzz b rererereerr ppppppp']
>>> parse_table(planets)
['Planets Planet Sun (Solar) Earth Moon Mars',
'Planets R (km) 696000 6371 1737 3390',
'Planets mass (x 10^29 kg) 1989100000 5973.6 73.5 641.85']

As the input is in the format of a reStructuredText table, you could use the docutils table parser.
import docutils.parsers.rst.tableparser
from collections.abc import Iterable
def extract_texts(tds):
" recursively extract StringLists and join"
texts = []
for e in tds:
if isinstance(e, docutils.statemachine.StringList):
texts.append(' '.join([s.strip() for s in list(e) if s]))
break
if isinstance(e, Iterable):
texts.append(extract_texts(e))
return texts
>>> parser = docutils.parsers.rst.tableparser.GridTableParser()
>>> tds = parser.parse(docutils.statemachine.StringList(tmp))
>>> extract_texts(tds)
[[],
[],
[[['Something to watch or idk'], ['Some header with subheader']],
[['First'], ['another text again with one more line']],
[['and this | how it be']]]]
then flatten.
For a more general usage, it is interesting to give a look in tds (the structure returned by parse): some documentation there

python maze using recursion

I want to solve a maze using recursion. The program opens a text file like this one:
10 20
1 1
10 20
-----------------------------------------
| | | | | | | | | |
|-+ +-+-+ +-+ + +-+ + + +-+-+ +-+-+ + + |
| | | | | | | | | |
| + +-+ + + +-+-+-+ + + + + + +-+ + + + |
| | | | | | | | | | |
| +-+-+-+-+ +-+ +-+-+-+-+ +-+ + +-+-+ +-|
| | | | | | | | | | |
| + + +-+ +-+-+ + + + +-+ +-+ + + + +-+ |
| | | | | | | | | | | |
|-+-+ +-+-+-+-+-+-+-+ +-+ +-+-+ +-+-+ +-|
| | | | | | | | | | | |
| +-+-+ +-+-+ +-+ + +-+-+ +-+ +-+ + + + |
| | | | | | | | | | |
|-+ +-+ + + +-+ +-+-+ + +-+ + + +-+ +-+ |
| | | | | | | | | | | | | | | | |
|-+ + +-+ + + + + + +-+ + + + + +-+-+ + |
| | | | | | | |
| + + +-+ + +-+-+-+ + +-+ + + +-+-+ +-+ |
| | | | | | | | | | |
-----------------------------------------
The first line of the file is the size of the maze(10 20), the second line is the starting point(1 1), and the third line is the exit(10, 20). I want to mark the correct path with "*". This is what my code looks like:
EDIT: I change some of the code in the findpath() funtion, and now I dont get any errors but the maze is empty, the path('*') is not 'drawn' on the maze.
class maze():
def __init__(self, file):
self.maze_list = []
data= file.readlines()
size = data.pop(0).split() # size of maze
start = data.pop(0).split() # starting row and column; keeps being 0 because the previous not there anymore
self.start_i = start[0] # starting row
self.start_j = start[1] # starting column
exit = data.pop(0).split() # exit row and column
self.end_i = exit[0]
self.end_j = exit[1]
for i in data:
self.maze_list.append(list(i[:-1])) # removes \n character off of each list of list
print(self.maze_list) # debug
def print_maze(self):
for i in range(len(self.maze_list)):
for j in range(len(self.maze_list[0])):
print(self.maze_list[i][j], end='')
print()
def main():
filename = input('Please enter a file name to be processed: ') # prompt user for a file
try:
file = open(filename, 'r')
except: # if a non-existing file is atempted to open, returns error
print("File Does Not Exist")
return
mazeObject = maze(file)
mazeObject.print_maze() # prints maze
row = int(mazeObject.start_i)
col = int(mazeObject.start_j)
findpath(row, col, mazeObject) # finds solution route of maze if any
def findpath(row, col, mazeObject):
if row == mazeObject.end_i and col == mazeObject.end_j: # returns True if it has found the Exit of the maze
mazeObject.maze_list[row][col] = ' ' # to delete wrong path
return True
elif mazeObject.maze_list[row][col] == "|": # returns False if it finds wall
return False
elif mazeObject.maze_list[row][col] '-': # returns False if it finds a wall
return False
elif mazeObject.maze_list[row][col] '+': # returns False if it finds a wall
return False
elif mazeObject.maze_list[row][col] '*': # returns False if the path has been visited
return False
mazeObject.maze_list[row][col] = '*' # marks the path followed with an *
if ((findpath(row + 1, col, mazeObject))
or (findpath(row, col - 1, mazeObject))
or (findpath(row - 1, col, mazeObject))
or (findpath(row, col + 1, mazeObject))): # recursion method
mazeObject.maze_list[row][col] = ' ' # to delete wrong path
return True
return False
So now my question is, where is the error? I mean the program just prints out the maze without the solution. I want to fill the correct path with "*".

Looking at your code I see several errors. You do not handle the entry and exit row/column pairs correctly. (10,20) is correct for this maze, if you assume that every other row, and every other column, is a grid line. That is, if the | and - characters represent infinitely thin lines that have occasional breaks in them, much like traditional maze drawings.
You'll need to multiple/divide by two, and deal with the inevitable fencepost errors, in order to correctly translate your file parameters into array row/column values.
Next, your findpath function is confused:
First, it should be a method of the class. It accesses internal data, and contains "inner knowledge" about the class details. Make it a method!
Second, your exit condition replaces the current character with a space "to delete wrong path". But if you have found the exit, the path is by definition correct. Don't do that!
Third, you have a bunch of if statements for various character types. That is fine, but please replace them with a single if statement using
if self.maze_list[row][col] in '|-+*':
return False
Fourth, you wait to mark the current cell with a '*' until after your checks. But you should mark the cell before you declare victory when you reach the exit location. Move the exit test down, I think.
That should clean things up nicely.
Fifth, and finally, your recursive test is backwards. Your code returns True when it reached the exit location. Your code returns False when it runs into a wall, or tries to cross its own path. Therefore, if the code takes a dead end path, it will reach the end, return false, unroll the recursion a few times, returning false all along, until it gets back.
Thus, if you EVER see a True return, you know the code found the exit down that path. You want to immediately return true and do nothing else. Certainly don't erase the path - it leads to the exit!
On the other hand, if none of your possible directions return true, then you have found a dead end - the exit does not lie in this direction. You should erase your path, return False, and hope that the exit can be found at a higher level.

Append value only when not found

If a taxonomy in taxonomies is not in translations. I want it to print 152W00000X | Not Found currently all of the lines print with Not Found. if I remove the else I get an out of range error.
taxonomies = ['152W00000X', '156FX1800X', '200000000X', '261QD0000X', '3336C0003X', '333600000X', '261QD0000X']
translations = {'261QD0000X': 'Clinic/Center Dental', '3336C0003X': 'Pharmacy Community/Retail Pharmacy', '333600000X': 'Pharmacy'}
a = 0
final = []
for nums in taxonomies:
for i, v in translations.items():
if nums == i:
data = v
final.append(data)
else:
final.append('Not Found')
for nums in taxonomies:
print nums, "|", final[a]
a = a + 1
Current output is:
152W00000X | Not Found
156FX1800X | Not Found
200000000X | Not Found
261QD0000X | Not Found
3336C0003X | Not Found
333600000X | Not Found
261QD0000X | Not Found
The ideal output is:
152W00000X | Not Found
156FX1800X | Not Found
200000000X | Not Found
261QD0000X | Clinic/Center Dental
3336C0003X | Pharmacy Community/Retail Pharmacy
333600000X | Pharmacy
261QD0000X | Clinic/Center Dental
taxonomies = ['152W00000X', '156FX1800X', '200000000X', '261QD0000X', '3336C0003X', '333600000X', '261QD0000X']
translations = {'261QD0000X': 'Clinic/Center Dental', '3336C0003X': 'Pharmacy Community/Retail Pharmacy', '333600000X': 'Pharmacy'}
a = 0
final = []
for nums in taxonomies:
final.append(translations.get(nums, 'Not Found'))
for nums in taxonomies:
print nums, "|", final[a]
a = a + 1

I am using re to split IDVtaxo.txt at two or more spaces. Unless the source is actually delimited by tabs then this will work.
import re
with open('IDVtaxo.txt') as f:
idvtaxo = {re.split(r'\s{2,}', x)[0]: re.split(r'\s{2,}', x)[2] for x in f.read().splitlines()}
with open('taxonomies.txt') as f:
taxonomies = f.read().splitlines()
for taxonomy in taxonomies:
data = taxonomy.split('|')
tranlated = idvtaxo.get(data[1], 'Not Found')
print '%s|%s' % (taxonomy, tranlated)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Apache beam CombinePerKey(sum) is not summing correctly - python

Related

Adding a newline to the printed data from an imported file

How to align strings in columns?

Parse ascii table header

python maze using recursion

Append value only when not found

Categories

Resources