Invalid character "\u64e" in token Pylance - python

What is the meaning of this error Invalid character "\u64e" in token Pylance the read error line under Acc for this code, How can fixed it?
err = calculateCError()
print('Error is:', err, '%')
َAcc = 100 - err
print('َAccuracy is:',َAcc , '%')

Here's how to debug something like this:
s = """err = calculateCError()
print('Error is:', err, '%')
َAcc = 100 - err
print('َAccuracy is:',َAcc , '%')"""
print([hex(ord(c)) for c in s])
['0x65', '0x72', '0x72', '0x20', '0x3d', '0x20', '0x63', '0x61', '0x6c',
'0x63', '0x75', '0x6c', '0x61', '0x74', '0x65', '0x43', '0x45', '0x72',
'0x72', '0x6f', '0x72', '0x28', '0x29', '0xa', '0x70', '0x72', '0x69',
'0x6e', '0x74', '0x28', '0x27', '0x45', '0x72', '0x72', '0x6f', '0x72',
'0x20', '0x69', '0x73', '0x3a', '0x27', '0x2c', '0x20', '0x65', '0x72',
'0x72', '0x2c', '0x20', '0x27', '0x25', '0x27', '0x29', '0xa', '0x64e',
'0x41', '0x63', '0x63', '0x20', '0x3d', '0x20', '0x31', '0x30', '0x30',
'0x20', '0x2d', '0x20', '0x65', '0x72', '0x72', '0xa', '0x70', '0x72',
'0x69', '0x6e', '0x74', '0x28', '0x27', '0x64e', '0x41', '0x63', '0x63',
'0x75', '0x72', '0x61', '0x63', '0x79', '0x20', '0x69', '0x73', '0x3a',
'0x27', '0x2c', '0x64e', '0x41', '0x63', '0x63', '0x20', '0x2c', '0x20',
'0x27', '0x25', '0x27', '0x29']
And sure enough, there are three instances of 0x64E, always appearing before 0x41 (A). In fact, if you look carefully at your A characters, you will notice a faint slanted accent line above the A. This is called Arabic Fatha in Unicode. Here is a 320% zoom from my browser showing it more obviously:

Related

Sudoku checker issues with Python

I'm trying to create a sudoku checker in Python. I found a version here in another thread, but it does not work properly. I wonder what is the issue?
I receive the following error:
Traceback (most recent call last):
File "C:\Users\Omistaja\Downloads\sudoku_checker_template.py", line 72, in <module>
main()
File "C:\Users\Omistaja\Downloads\sudoku_checker_template.py", line 63, in main
is_valid = check_sudoku_grid(grid)
File "C:\Users\Omistaja\Downloads\sudoku_checker_template.py", line 20, in check_sudoku_grid
if grid[row][col] < 1 or type(grid[row][col]) is not type(1):
TypeError: '<' not supported between instances of 'NoneType' and 'int'
Anyway, below is the whole thing. Only the check_sudoku_grid should be modified, the rest should work. Thanks for your help!
from grids import GRID_NAMES, GRID_RETURNS, GRIDS, GRIDS_BIG, GRIDS_SMALL
GRID_SIZE = 9 # Length of one side of the sudoku
SUBGRID_SIZE = 3 # Length of one side of a cell of the sudoku
def check_sudoku_grid(grid):
"""
Parameter : GRID_SIZE * GRID_SIZE two-dimensional list
Return value : Boolean (True/False)
Checks whether a sudoku grid is valid
ie. doesn't contain any duplicates (besides None)
in any row, column or cell.
"""
for row in range(len(grid)):
for col in range(len(grid)):
# check value is an int
if grid[row][col] < 1 or type(grid[row][col]) is not type(1):
return False
# check value is within 1 through n.
# for example a 2x2 grid should not have the value 8 in it
elif grid[row][col] > len(grid):
return False
# check the rows
for row in grid:
if sorted(list(set(row))) != sorted(row):
return False
# check the cols
cols = []
for col in range(len(grid)):
for row in grid:
cols += [row[col]]
# set will get unique values, its converted to list so you can compare
# it's sorted so the comparison is done correctly.
if sorted(list(set(cols))) != sorted(cols):
return False
cols = []
# if you get past all the false checks return True
return True
def print_grid(grid):
for i in range(GRID_SIZE):
row = ""
for j in range(GRID_SIZE):
try:
val = int(grid[i][j])
except TypeError:
val = "_"
except ValueError:
val = grid[i][j]
row += "{} ".format(val)
if j % SUBGRID_SIZE == SUBGRID_SIZE - 1:
row += " "
print(row)
if i % SUBGRID_SIZE == SUBGRID_SIZE - 1:
print()
def main():
i = 0
for grid in GRIDS:
is_valid = check_sudoku_grid(grid)
print("This grid {:s}.".format(GRID_NAMES[i]))
print("Your function should return: {:s}".format(GRID_RETURNS[i]))
print("Your function returns: {}".format(is_valid))
print_grid(grid)
i += 1
main()
GRID_NAMES = ["is valid", "is valid containing None values", "is valid containing None values (2)", \
"has an invalid row", "has an invalid column", "has an invalid subgrid"]
GRID_RETURNS = ["True","True","True","False","False","False"]
n = None
a = 'a'
b = 'b'
c = 'c'
d = 'd'
e = 'e'
f = 'f'
g = 'g'
GRID_VALID = [[7,3,5, 6,1,4, 8,9,2],
[8,4,2, 9,7,3, 5,6,1],
[9,6,1, 2,8,5, 3,7,4],
[2,8,6, 3,4,9, 1,5,7],
[4,1,3, 8,5,7, 9,2,6],
[5,7,9, 1,2,6, 4,3,8],
[1,5,7, 4,9,2, 6,8,3],
[6,9,4, 7,3,8, 2,1,5],
[3,2,8, 5,6,1, 7,4,9]
]
GRID_VALID_NONE = [[7,3,5, 6,1,4, 8,9,2],
[8,4,2, 9,7,3, 5,6,1],
[9,6,1, 2,8,5, 3,7,4],
[2,n,n, 3,4,n, 1,5,7],
[4,1,3, 8,5,7, 9,2,6],
[5,7,9, 1,2,6, 4,3,8],
[1,5,7, 4,9,n, 6,8,3],
[6,9,4, 7,3,8, 2,1,5],
[n,2,8, 5,6,1, 7,4,9]
]
GRID_VALID_NONE_2 = [[7,3,5, 6,1,4, n,9,2],
[8,4,2, 9,7,3, 5,6,1],
[n,n,1, 2,8,5, 3,7,4],
[2,n,n, 3,4,n, 1,5,7],
[4,1,3, 8,5,7, 9,2,6],
[5,n,9, 1,2,6, 4,3,8],
[1,5,7, 4,9,n, n,8,3],
[6,9,4, 7,3,8, 2,1,5],
[n,2,8, 5,6,1, 7,4,n]
]
GRID_INVALID_SUBGRID = [[7,3,5, 6,1,4, 8,9,2],
[8,4,2, 9,7,3, 5,6,1],
[9,6,1, 2,8,5, 3,7,4],
[2,8,6, 3,4,9, 1,5,7],
[4,1,3, n,5,7, 9,2,6],
[5,7,9, 1,2,6, 4,3,8],
[1,5,7, 4,9,2, 6,8,3],
[6,9,4, 7,3,8, 2,1,5],
[3,2,n, 8,6,1, 7,4,9]
]
GRID_INVALID_ROW = [[7,3,5, 6,1,4, 8,9,2],
[8,4,2, 9,7,3, 5,6,1],
[9,6,1, 2,8,5, 3,7,4],
[2,8,6, 3,4,9, 1,5,7],
[4,1,3, 8,5,7, 9,2,6],
[5,7,9, 1,2,6, 4,3,8],
[1,5,7, 4,9,2, 6,8,n],
[6,9,4, 7,3,8, 2,1,3],
[3,2,8, 5,6,1, 7,4,9]
]
GRID_INVALID_COLUMN = [[7,3,5, 6,1,4, 8,9,2],
[8,4,2, 9,7,3, 5,6,1],
[9,6,1, 2,8,5, 3,7,4],
[2,8,6, 3,4,9, 1,5,7],
[4,1,3, 8,5,7, 9,2,6],
[5,7,9, 1,2,6, 4,3,8],
[1,5,n, 4,9,2, 6,8,3],
[6,9,4, 7,3,8, 2,1,5],
[7,2,8, 5,6,1, n,4,9]
]
GRIDS = [GRID_VALID, GRID_VALID_NONE, GRID_VALID_NONE_2, \
GRID_INVALID_ROW, GRID_INVALID_COLUMN, GRID_INVALID_SUBGRID]
GRID_SMALL_VALID = [[1,2, 3,4],
[3,4, 1,2],
[2,3, 4,1],
[4,1, 2,3]]
GRID_SMALL_VALID_NONE = [[1,n, 3,4],
[3,4, n,n],
[2,n, 4,1],
[4,1, n,3]]
GRID_SMALL_VALID_NONE_2 = [[1,n, 3,4],
[n,n, n,2],
[2,n, 4,1],
[4,n, 2,3]]
GRID_SMALL_INVALID_ROW = [[1,2, 3,n],
[2,3, 4,4],
[3,4, 1,2],
[4,1, 2,3]]
GRID_SMALL_INVALID_COLUMN = [[1,2, 3,4],
[2,3, 4,1],
[3,4, n,1],
[4,1, 2,3]]
GRID_SMALL_INVALID_SUBGRID = [[1,2, 3,4],
[2,3, 4,1],
[3,4, 1,2],
[4,1, 2,3]]
GRIDS_SMALL = [GRID_SMALL_VALID, GRID_SMALL_VALID_NONE, GRID_SMALL_VALID_NONE_2, \
GRID_SMALL_INVALID_ROW, GRID_SMALL_INVALID_COLUMN, GRID_SMALL_INVALID_SUBGRID]
GRID_BIG_VALID = [[4,a,9,f, 1,7,d,8, 6,e,2,c, g,5,3,b],
[2,5,3,1, f,4,b,g, d,9,8,7, 6,a,c,e],
[e,6,d,c, 3,a,5,2, g,b,1,4, 8,f,9,7],
[b,7,g,8, 6,e,9,c, 5,3,a,f, 1,2,d,4],
[8,g,b,4, d,f,e,9, 2,5,7,3, c,1,a,6],
[1,e,6,d, c,8,4,5, a,g,9,b, 2,3,7,f],
[a,f,5,3, 2,1,6,7, 4,c,e,8, 9,b,g,d],
[c,2,7,9, b,3,g,a, f,d,6,1, 4,8,e,5],
[9,4,1,a, e,2,3,d, b,f,c,6, 7,g,5,8],
[5,8,e,g, 7,9,1,6, 3,4,d,a, b,c,f,2],
[7,3,f,6, g,b,c,4, 8,2,5,9, e,d,1,a],
[d,c,2,b, a,5,8,f, 7,1,g,e, 3,6,4,9],
[f,9,8,2, 4,c,7,3, 1,a,b,d, 5,e,6,g],
[6,d,c,5, 9,g,f,1, e,8,4,2, a,7,b,3],
[g,b,4,7, 8,d,a,e, c,6,3,5, f,9,2,1],
[3,1,a,e, 5,6,2,b, 9,7,f,g, d,4,8,c]
]
GRID_BIG_VALID_NONE = [[4,a,9,n, 1,7,d,8, 6,e,n,c, g,5,3,n],
[n,5,3,1, f,n,b,g, d,9,8,7, 6,a,c,e],
[e,6,d,c, 3,a,5,2, g,b,n,4, n,f,n,7],
[b,7,n,8, n,e,9,c, n,3,a,f, 1,2,d,4],
[8,g,b,4, d,f,n,9, 2,5,7,n, c,1,a,6],
[1,e,n,d, c,n,4,5, a,g,n,b, 2,3,7,f],
[a,f,n,3, 2,1,n,7, n,n,e,8, 9,b,g,n],
[c,2,7,9, b,3,g,a, f,d,6,1, 4,n,n,5],
[9,4,1,a, e,n,3,d, b,f,c,6, 7,g,5,8],
[5,n,e,g, 7,9,n,6, 3,4,d,a, b,n,f,2],
[7,3,f,6, g,b,c,4, n,n,5,9, e,d,n,a],
[n,n,n,b, a,5,8,f, 7,1,n,e, 3,6,4,9],
[f,9,8,2, 4,c,7,3, n,n,b,d, 5,e,6,g],
[6,n,c,5, 9,n,f,1, e,n,4,2, a,7,n,3],
[g,b,4,7, 8,d,a,e, c,6,n,5, f,9,2,n],
[3,1,n,n, n,6,2,b, 9,7,f,g, d,4,8,c]
]
GRID_BIG_VALID_NONE_2 = [[4,a,9,f, 1,7,d,n, 6,e,n,n, g,n,3,b],
[2,5,3,1, f,4,b,g, d,n,8,7, 6,a,n,e],
[e,6,d,c, 3,a,n,2, g,b,1,4, 8,f,9,7],
[b,7,g,n, n,e,9,c, 5,3,a,n, n,2,d,4],
[8,g,b,4, d,f,e,n, 2,5,7,3, c,1,a,6],
[1,n,6,d, n,n,4,n, a,g,n,b, 2,3,7,f],
[a,f,5,3, 2,1,6,7, 4,c,e,8, 9,b,g,d],
[c,2,7,n, b,3,g,a, f,d,6,1, 4,8,e,5],
[9,4,1,a, e,2,n,n, b,f,c,n, 7,g,5,8],
[5,n,e,g, 7,9,n,6, 3,4,d,a, b,c,f,2],
[7,3,f,6, g,b,c,4, 8,n,n,n, e,d,1,a],
[d,c,2,n, a,n,8,f, 7,1,g,n, 3,6,n,9],
[f,n,8,2, 4,c,7,3, 1,a,b,d, n,e,6,n],
[6,d,c,5, 9,g,f,1, e,8,4,2, a,7,b,3],
[g,b,n,7, 8,d,a,e, n,6,n,5, f,n,2,n],
[n,1,a,e, n,6,2,b, 9,n,f,g, d,n,8,c]
]
GRID_BIG_INVALID_ROW = [[4,a,9,f, 1,7,d,8, 6,e,2,c, g,5,3,b],
[2,5,3,1, f,4,b,g, d,9,8,7, 6,a,c,e],
[e,6,d,c, 3,a,5,2, g,b,1,4, 8,f,9,7],
[b,7,g,8, 6,e,9,c, 5,3,a,f, 1,2,d,4],
[8,g,b,4, d,f,e,9, 2,5,7,3, c,1,a,6],
[1,e,6,d, c,n,4,5, a,g,9,b, 2,3,7,f],
[a,f,5,3, 2,1,6,7, 4,c,e,8, 9,b,g,d],
[c,2,7,9, b,3,g,a, f,d,6,1, 4,8,e,5],
[9,4,1,a, e,2,3,d, b,f,c,6, 7,g,5,8],
[5,8,e,g, 7,9,1,6, 3,4,d,a, b,c,f,2],
[7,3,f,6, g,b,c,4, 8,2,5,9, e,d,1,a],
[d,c,2,b, a,8,8,f, 7,1,g,e, 3,6,4,9],
[f,9,8,2, 4,c,7,3, 1,a,b,d, 5,e,6,g],
[6,d,c,5, 9,g,f,1, e,8,4,2, a,7,b,3],
[g,b,4,7, 8,d,a,e, c,6,3,5, f,9,2,1],
[3,1,a,e, 5,6,2,b, 9,7,f,g, d,4,8,c]
]
GRID_BIG_INVALID_COLUMN = [[4,a,9,f, 1,7,d,8, 6,e,2,c, g,5,3,b],
[2,5,3,1, f,4,b,g, d,9,8,7, 6,a,c,e],
[e,6,d,c, 3,a,5,2, g,b,1,4, 8,f,9,7],
[b,7,g,8, 6,e,9,c, 5,3,a,f, 1,2,d,4],
[8,g,b,4, d,f,e,9, 2,5,7,3, c,1,a,6],
[1,e,6,d, c,8,4,5, a,g,9,b, 2,3,7,f],
[a,f,5,3, 2,1,6,n, 4,c,e,8, 9,b,7,d],
[c,2,7,9, b,3,g,a, f,d,6,1, 4,8,e,5],
[9,4,1,a, e,2,3,d, b,f,c,6, 7,g,5,8],
[5,8,e,g, 7,9,1,6, 3,4,d,a, b,c,f,2],
[7,3,f,6, g,b,c,4, 8,2,5,9, e,d,1,a],
[d,c,2,b, a,5,8,f, 7,1,g,e, 3,6,4,9],
[f,9,8,2, 4,c,7,3, 1,a,b,d, 5,e,6,g],
[6,d,c,5, 9,g,f,1, e,8,4,2, a,7,b,3],
[g,b,4,7, 8,d,a,e, c,6,3,5, f,9,2,1],
[3,1,a,e, 5,6,2,b, 9,7,f,g, d,4,8,c]
]
GRID_BIG_INVALID_SUBGRID = [[4,a,9,f, 1,7,d,8, 6,e,2,c, g,5,3,b],
[2,5,3,1, f,4,b,g, d,9,8,7, 6,a,c,e],
[e,6,d,c, 3,a,5,2, g,b,1,4, 8,f,9,7],
[b,7,g,8, 6,e,9,c, 5,3,a,f, 1,2,d,4],
[8,g,b,4, d,f,e,9, 2,5,7,3, c,1,a,6],
[1,e,6,d, c,8,4,5, a,g,9,b, 2,3,7,f],
[a,f,5,3, 2,1,6,7, 4,c,e,8, 9,b,g,d],
[c,2,7,9, b,3,g,a, f,d,6,1, 4,8,e,5],
[9,4,1,a, e,2,3,d, b,f,c,6, 7,g,5,8],
[5,8,e,g, 7,9,1,6, 3,n,d,a, b,c,f,2],
[7,3,f,6, g,b,c,4, 8,2,5,9, e,d,1,a],
[d,c,2,b, a,5,8,f, 7,1,g,e, 3,6,4,9],
[f,9,8,2, 4,c,7,3, 1,a,b,d, 5,e,6,g],
[6,d,c,5, 9,g,f,1, e,8,4,2, a,7,b,3],
[g,b,n,7, 8,d,a,e, c,4,3,5, f,9,2,1],
[3,1,a,e, 5,6,2,b, 9,7,f,g, d,4,8,c]
]
GRIDS_BIG = [GRID_BIG_VALID, GRID_BIG_VALID_NONE, GRID_BIG_VALID_NONE_2, \
GRID_BIG_INVALID_ROW, GRID_BIG_INVALID_COLUMN, GRID_BIG_INVALID_SUBGRID]

With PYTHON convert CSV file to XML file

I want convert a csv file to xml file with python. I want to group the same id's in the csv file together and convert the csv in to convert xml( see desired ouput ). Its a bit complex than it looks with indentation, looping and grouping the csv to xml. All help is appreciated.
My CSV file:
id,x1,y1,z1,x2,y2,z2,c1,R
a1,1.3,2.1,3.6,4.5,5.1,6.8,B,7.3
b2,1.1,2.1,3.1,4.1,5.1,6.1,G,7.1
c1,2.1,3.1,4.1,5.1,2.1,7.1,G,8.1
a1,2.2,3.2,4.2,5.2,6.2,7.2,S,8.2
b2,4.1,5.1,2.1,7.1,8.1,9.1,S,2.5
b2,3.6,4.5,5.1,6.3,7.4,8.2,G,3.1
c2,6.1,7.1,8.1,9.1,2.1,11.1,S,3.2
c1,1.5,1.5,1.5,1.5,1.5,1.5,A,1.5
my code:
import itertools
import csv
import os
csvFile = r'C:\Users\Desktop\test XML\csvfile.csv'
xmlFile = r'C:\Users\Desktop\test XML\myData.xml'
csvData = csv.reader(open(csvFile))
xmlData = open(xmlFile, 'w')
xmlData.write('<?xml version="1.0" encoding="UTF-8"?>' + "\n" +'<Roughness-Profiles xmlns="http://WKI/Roughness-Profiles/1">' + "\n" )
xmlData.write(' '+'<Roughness-Profile>' + "\n")
rowNum = 0
for row in csvData:
if rowNum == 0:
tags = row
# replace spaces w/ underscores in tag names
for i in range(len(tags)):
tags[i] = tags[i].replace(' ', '_')
else:
xmlData.write(' '+'<surfaces>' +"\n"+' '+'<surface>' + "\n")
for i in range (len(tags)):
xmlData.write(' ' +'<' + tags[i] + '>' \
+ row[i] + '</' + tags[i] + '>' + "\n")
xmlData.write(' '+'</surface>' + "\n" + ' '+'</surfaces>' + "\n" + ' '+'</Roughness-Profile>' + "\n")
rowNum +=1
xmlData.write('</Roughness-Profiles>' + "\n")
xmlData.close()
my xml output:
<?xml version="1.0" encoding="UTF-8"?>
<Roughness-Profiles xmlns="http://WKI/Roughness-Profiles/1">
<Roughness-Profile>
<surfaces>
<surface>
<id>a1</id>
<x1>1.3</x1>
<y1>2.1</y1>
<z1>3.6</z1>
<x2>4.5</x2>
<y2>5.1</y2>
<z2>6.8</z2>
<c1>B</c1>
<R>7.3</R>
</surface>
</surfaces>
</Roughness-Profile>
<surfaces>
<surface>
<id>b2</id>
<x1>1.1</x1>
<y1>2.1</y1>
<z1>3.1</z1>
<x2>4.1</x2>
<y2>5.1</y2>
<z2>6.1</z2>
<c1>G</c1>
<R>7.1</R>
</surface>
</surfaces>
</Roughness-Profile>
<surfaces>
<surface>
<id>c1</id>
<x1>2.1</x1>
<y1>3.1</y1>
<z1>4.1</z1>
<x2>5.1</x2>
<y2>2.1</y2>
<z2>7.1</z2>
<c1>G</c1>
<R>8.1</R>
</surface>
</surfaces>
</Roughness-Profile>
<surfaces>
<surface>
<id>a1</id>
<x1>2.2</x1>
<y1>3.2</y1>
<z1>4.2</z1>
<x2>5.2</x2>
<y2>6.2</y2>
<z2>7.2</z2>
<c1>S</c1>
<R>8.2</R>
</surface>
</surfaces>
</Roughness-Profile>
<surfaces>
<surface>
<id>b2</id>
<x1>4.1</x1>
<y1>5.1</y1>
<z1>2.1</z1>
<x2>7.1</x2>
<y2>8.1</y2>
<z2>9.1</z2>
<c1>S</c1>
<R>2.5</R>
</surface>
</surfaces>
</Roughness-Profile>
<surfaces>
<surface>
<id>b2</id>
<x1>3.6</x1>
<y1>4.5</y1>
<z1>5.1</z1>
<x2>6.3</x2>
<y2>7.4</y2>
<z2>8.2</z2>
<c1>G</c1>
<R>3.1</R>
</surface>
</surfaces>
</Roughness-Profile>
<surfaces>
<surface>
<id>c2</id>
<x1>6.1</x1>
<y1>7.1</y1>
<z1>8.1</z1>
<x2>9.1</x2>
<y2>2.1</y2>
<z2>11.1</z2>
<c1>S</c1>
<R>3.2</R>
</surface>
</surfaces>
</Roughness-Profile>
<surfaces>
<surface>
<id>c1</id>
<x1>1.5</x1>
<y1>1.5</y1>
<z1>1.5</z1>
<x2>1.5</x2>
<y2>1.5</y2>
<z2>1.5</z2>
<c1>A</c1>
<R>1.5</R>
</surface>
</surfaces>
</Roughness-Profile>
</Roughness-Profiles>
Desired output should be:
<?xml version="1.0" encoding="UTF-8"?>
<R-Profiles xmlns="http://WKI/R-Profiles/1">
<R-Profile>
<id>a1</id>
<surfaces>
<surface>
<x1>1.3</x1>
<y1>2.1</y1>
<z1>3.6</z1>
<x2>4.5</x2>
<y2>5.1</y2>
<z2>6.8</z2>
<c1>B</c1>
<R>7.3</R>
</surface>
<surface>
<x1>2.2</x1>
<y1>3.2</y1>
<z1>4.2</z1>
<x2>5.2</x2>
<y2>6.2</y2>
<z2>7.2</z2>
<c1>S</c1>
<R>8.2</R>
</surface>
</surfaces>
</R-Profile>
<R-Profile>
<id>b2</id>
<surfaces>
<surface>
<x1>1.1</x1>
<y1>2.1</y1>
<z1>3.1</z1>
<x2>4.1</x2>
<y2>5.1</y2>
<z2>6.1</z2>
<c1>G</c1>
<R>7.1</R>
</surface>
<surface>
<x1>4.1</x1>
<y1>5.1</y1>
<z1>2.1</z1>
<x2>7.1</x2>
<y2>8.1</y2>
<z2>9.1</z2>
<c1>S</c1>
<R>2.5</R>
</surface>
<surface>
<x1>3.6</x1>
<y1>4.5</y1>
<z1>5.1</z1>
<x2>6.3</x2>
<y2>7.4</y2>
<z2>8.2</z2>
<c1>G</c1>
<R>3.1</R>
</surface>
</surfaces>
</R-Profile>
<R-Profile>
<id>c1</id>
<surfaces>
<surface>
<x1>2.1</x1>
<y1>3.1</y1>
<z1>4.1</z1>
<x2>5.1</x2>
<y2>2.1</y2>
<z2>7.1</z2>
<c1>G</c1>
<R>8.1</R>
</surface>
<surface>
<x1>1.5</x1>
<y1>1.5</y1>
<z1>1.5</z1>
<x2>1.5</x2>
<y2>1.5</y2>
<z2>1.5</z2>
<c1>A</c1>
<R>1.5</R>
</surface>
</surfaces>
</R-Profile>
<R-Profile>
<id>c2</id>
<surfaces>
<surface>
<x1>6.1</x1>
<y1>7.1</y1>
<z1>8.1</z1>
<x2>9.1</x2>
<y2>2.1</y2>
<z2>11.1</z2>
<c1>S</c1>
<R>3.2</R>
</surface>
</surfaces>
</R-Profile>
</R-Profiles>
I would do something very similar to what #Parfait suggested; use csv.DictReader and lxml to create the XML.
However, something is missing from that answer; the surface elements aren't grouped by id.
If I need to group XML during a transformation, the first thing I think of is XSLT.
Once you get the hang of it, grouping is easy with XSLT; especially 2.0 or greater. Unfortunately lxml only supports XSLT 1.0. In 1.0 you need to use Muenchian Grouping.
Here's a full example of creating an intermediate XML and transforming it with XSLT.
CSV Input (test.csv)
id,x1,y1,z1,x2,y2,z2,c1,R
a1,1.3,2.1,3.6,4.5,5.1,6.8,B,7.3
b2,1.1,2.1,3.1,4.1,5.1,6.1,G,7.1
c1,2.1,3.1,4.1,5.1,2.1,7.1,G,8.1
a1,2.2,3.2,4.2,5.2,6.2,7.2,S,8.2
b2,4.1,5.1,2.1,7.1,8.1,9.1,S,2.5
b2,3.6,4.5,5.1,6.3,7.4,8.2,G,3.1
c2,6.1,7.1,8.1,9.1,2.1,11.1,S,3.2
c1,1.5,1.5,1.5,1.5,1.5,1.5,A,1.5
XSLT 1.0 (test.xsl)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:rp="http://WKI/Roughness-Profiles/1">
<xsl:output indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:key name="surface" match="rp:surface" use="rp:id"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="/*">
<xsl:copy>
<xsl:apply-templates select="#*"/>
<xsl:for-each select="rp:surface[count(.|key('surface',rp:id)[1])=1]">
<xsl:element name="Roughness-Profile" namespace="http://WKI/Roughness-Profiles/1">
<xsl:copy-of select="rp:id"/>
<xsl:element name="surfaces" namespace="http://WKI/Roughness-Profiles/1">
<xsl:apply-templates select="key('surface',rp:id)"/>
</xsl:element>
</xsl:element>
</xsl:for-each>
</xsl:copy>
</xsl:template>
<xsl:template match="rp:id"/>
</xsl:stylesheet>
Python
import csv
import lxml.etree as etree
# INITIALIZING XML FILE WITH ROOT IN PROPER NAMESPACE
nsmap = {None: "http://WKI/Roughness-Profiles/1"}
root = etree.Element('Roughness-Profiles', nsmap=nsmap)
# READING CSV FILE
with open("test.csv") as f:
reader = csv.DictReader(f)
# WRITE INITIAL XML NODES
for row in reader:
surface_elem = etree.SubElement(root, "surface", nsmap=nsmap)
for elem_name, elem_value in row.items():
etree.SubElement(surface_elem, elem_name.strip(), nsmap=nsmap).text = str(elem_value)
# PARSE XSLT AND CREATE TRANSFORMER
xslt_root = etree.parse("test.xsl")
transform = etree.XSLT(xslt_root)
# TRANSFORM
# (Note the weird use of tostring/fromstring. This was used so
# namespaces in the XSLT would work the way they're supposed to.)
final_xml = transform(etree.fromstring(etree.tostring(root)))
# WRITE OUTPUT TO FILE
final_xml.write_output("test.xml")
XML Output (test.xml)
<?xml version="1.0"?>
<Roughness-Profiles xmlns="http://WKI/Roughness-Profiles/1">
<Roughness-Profile>
<id>a1</id>
<surfaces>
<surface>
<x1>1.3</x1>
<y1>2.1</y1>
<z1>3.6</z1>
<x2>4.5</x2>
<y2>5.1</y2>
<z2>6.8</z2>
<c1>B</c1>
<R>7.3</R>
</surface>
<surface>
<x1>2.2</x1>
<y1>3.2</y1>
<z1>4.2</z1>
<x2>5.2</x2>
<y2>6.2</y2>
<z2>7.2</z2>
<c1>S</c1>
<R>8.2</R>
</surface>
</surfaces>
</Roughness-Profile>
<Roughness-Profile>
<id>b2</id>
<surfaces>
<surface>
<x1>1.1</x1>
<y1>2.1</y1>
<z1>3.1</z1>
<x2>4.1</x2>
<y2>5.1</y2>
<z2>6.1</z2>
<c1>G</c1>
<R>7.1</R>
</surface>
<surface>
<x1>4.1</x1>
<y1>5.1</y1>
<z1>2.1</z1>
<x2>7.1</x2>
<y2>8.1</y2>
<z2>9.1</z2>
<c1>S</c1>
<R>2.5</R>
</surface>
<surface>
<x1>3.6</x1>
<y1>4.5</y1>
<z1>5.1</z1>
<x2>6.3</x2>
<y2>7.4</y2>
<z2>8.2</z2>
<c1>G</c1>
<R>3.1</R>
</surface>
</surfaces>
</Roughness-Profile>
<Roughness-Profile>
<id>c1</id>
<surfaces>
<surface>
<x1>2.1</x1>
<y1>3.1</y1>
<z1>4.1</z1>
<x2>5.1</x2>
<y2>2.1</y2>
<z2>7.1</z2>
<c1>G</c1>
<R>8.1</R>
</surface>
<surface>
<x1>1.5</x1>
<y1>1.5</y1>
<z1>1.5</z1>
<x2>1.5</x2>
<y2>1.5</y2>
<z2>1.5</z2>
<c1>A</c1>
<R>1.5</R>
</surface>
</surfaces>
</Roughness-Profile>
<Roughness-Profile>
<id>c2</id>
<surfaces>
<surface>
<x1>6.1</x1>
<y1>7.1</y1>
<z1>8.1</z1>
<x2>9.1</x2>
<y2>2.1</y2>
<z2>11.1</z2>
<c1>S</c1>
<R>3.2</R>
</surface>
</surfaces>
</Roughness-Profile>
</Roughness-Profiles>
First read all rows from CSV and sort them.
Later you can use variable previous_id to open and close Roughness-Profile/surfaces only when id in new row is different then in previous one.
I used StringIO to simulate csv file and sys.stdout to simulate xml file - so everybody can copy code and run it to see how it works
text ='''id,x1,y1,z1,x2,y2,z2,c1,R
a1,1.3,2.1,3.6,4.5,5.1,6.8,B,7.3
b2,1.1,2.1,3.1,4.1,5.1,6.1,G,7.1
c1,2.1,3.1,4.1,5.1,2.1,7.1,G,8.1
a1,2.2,3.2,4.2,5.2,6.2,7.2,S,8.2
b2,4.1,5.1,2.1,7.1,8.1,9.1,S,2.5
b2,3.6,4.5,5.1,6.3,7.4,8.2,G,3.1
c2,6.1,7.1,8.1,9.1,2.1,11.1,S,3.2
c1,1.5,1.5,1.5,1.5,1.5,1.5,A,1.5'''
from io import StringIO
import csv
import sys
#csvFile = r'C:\Users\Desktop\test XML\csvfile.csv'
#xmlFile = r'C:\Users\Desktop\test XML\myData.xml'
#csvData = csv.reader(open(csvFile))
#xmlData = open(xmlFile, 'w')
csvData = csv.reader(StringIO(text))
xmlData = sys.stdout
# read all data to sort them
csvData = list(csvData)
tags = [item.replace(' ', '_') for item in csvData[0]] # headers
csvData = sorted(csvData[1:]) # sort data without headers
xmlData.write('<?xml version="1.0" encoding="UTF-8"?>\n<Roughness-Profiles xmlns="http://WKI/Roughness-Profiles/1">\n')
previous_id = None
for row in csvData:
row_id = row[0]
if row_id != previous_id:
# close previous group - but only if it is not first group
if previous_id is not None:
xmlData.write('</surfaces>\n</Roughness-Profile>\n')
# open new group
xmlData.write('<Roughness-Profile>\n<id>{}</id>\n<surfaces>\n'.format(row_id))
# remeber new group's id
previous_id = row_id
# surface
xmlData.write('<surface>\n')
for value, tag in zip(row[1:], tags[1:]):
xmlData.write('<{}>{}</{}>\n'.format(tag, value, tag))
xmlData.write('</surface>\n')
# close last group
xmlData.write('</surfaces>\n</Roughness-Profile>\n')
xmlData.write('</Roughness-Profiles>\n')
#xmlData.close()
Version without StringIO and sys.stdout
import csv
csvFile = r'C:\Users\Desktop\test XML\csvfile.csv'
xmlFile = r'C:\Users\Desktop\test XML\myData.xml'
csvData = csv.reader(open(csvFile))
xmlData = open(xmlFile, 'w')
# read all data to sort them
csvData = list(csvData)
tags = [item.replace(' ', '_') for item in csvData[0]] # headers
csvData = sorted(csvData[1:]) # sort data without headers
xmlData.write('<?xml version="1.0" encoding="UTF-8"?>\n<Roughness-Profiles xmlns="http://WKI/Roughness-Profiles/1">\n')
previous_id = None
for row in csvData:
row_id = row[0]
if row_id != previous_id:
# close previous group - but only if it is not first group
if previous_id is not None:
xmlData.write('</surfaces>\n</Roughness-Profile>\n')
# open new group
xmlData.write('<Roughness-Profile>\n<id>{}</id>\n<surfaces>\n'.format(row_id))
# remeber new group's id
previous_id = row_id
# surface
xmlData.write('<surface>\n')
for value, tag in zip(row[1:], tags[1:]):
xmlData.write('<{}>{}</{}>\n'.format(tag, value, tag))
xmlData.write('</surface>\n')
# close last group
xmlData.write('</surfaces>\n</Roughness-Profile>\n')
xmlData.write('</Roughness-Profiles>\n')
xmlData.close()
Because XML files are not text files but special text-based documents of structured data adhering to W3C specifications, avoiding building the document by string concatenation.
Instead use appropriate DOM libraries available in virtually all modern programming languages including Python with its built-in xml.etree or more robust, third-party module, lxml. In fact, because your desired output involves grouping nodes by id, consider running XSLT, the special-purpose language designed to transform XML files. The module, lxml can run XSLT 1.0 scripts.
Below uses the DictReader of built-in csv module to build a nested id dictionary (all columns grouped under id keys). Then, XML is built by iterating through content of this dictionary to write data to element nodes.
import csv
from collections import OrderedDict
import lxml.etree as ET
# BUILD NESTED ID DICTIONARY FROM CSV
with open("Input.csv") as f:
reader = csv.DictReader(f)
id_dct = OrderedDict({})
for dct in reader:
if dct["id"] not in id_dct.keys():
id_dct[dct["id"]] = [OrderedDict({k:v for k,v in dct.items() if k!= "id"})]
else:
id_dct[dct["id"]].append(OrderedDict({k:v for k,v in dct.items() if k!= "id"}))
# INITIALIZING XML FILE WITH ROOT AND NAMESPACE
root = ET.Element('R-Profiles', nsmap={None: "http://WKI/Roughness-Profiles/1"})
# WRITING TO XML NODES
for k,v in id_dct.items():
rpNode = ET.SubElement(root, "R-Profile")
ET.SubElement(rpNode, "id").text = str(k)
surfacesNode = ET.SubElement(rpNode, "surfaces")
for dct in v:
surfaceNode = ET.SubElement(surfacesNode, "surface")
for k,v in dct.items():
ET.SubElement(surfaceNode, k).text = str(v)
# OUTPUT XML CONTENT TO FILE
tree_out = ET.tostring(root, pretty_print=True, xml_declaration=True, encoding="UTF-8")
with open('Output.xml','wb') as f:
f.write(tree_out)
Input.csv
id,x1,y1,z1,x2,y2,z2,c1,R
a1,1.3,2.1,3.6,4.5,5.1,6.8,B,7.3
b2,1.1,2.1,3.1,4.1,5.1,6.1,G,7.1
c1,2.1,3.1,4.1,5.1,2.1,7.1,G,8.1
a1,2.2,3.2,4.2,5.2,6.2,7.2,S,8.2
b2,4.1,5.1,2.1,7.1,8.1,9.1,S,2.5
b2,3.6,4.5,5.1,6.3,7.4,8.2,G,3.1
c2,6.1,7.1,8.1,9.1,2.1,11.1,S,3.2
c1,1.5,1.5,1.5,1.5,1.5,1.5,A,1.5
Output.xml
<?xml version='1.0' encoding='UTF-8'?>
<R-Profiles xmlns="http://WKI/Roughness-Profiles/1">
<R-Profile>
<id>a1</id>
<surfaces>
<surface>
<x1>1.3</x1>
<y1>2.1</y1>
<z1>3.6</z1>
<x2>4.5</x2>
<y2>5.1</y2>
<z2>6.8</z2>
<c1>B</c1>
<R>7.3</R>
</surface>
<surface>
<x1>2.2</x1>
<y1>3.2</y1>
<z1>4.2</z1>
<x2>5.2</x2>
<y2>6.2</y2>
<z2>7.2</z2>
<c1>S</c1>
<R>8.2</R>
</surface>
</surfaces>
</R-Profile>
<R-Profile>
<id>b2</id>
<surfaces>
<surface>
<x1>1.1</x1>
<y1>2.1</y1>
<z1>3.1</z1>
<x2>4.1</x2>
<y2>5.1</y2>
<z2>6.1</z2>
<c1>G</c1>
<R>7.1</R>
</surface>
<surface>
<x1>4.1</x1>
<y1>5.1</y1>
<z1>2.1</z1>
<x2>7.1</x2>
<y2>8.1</y2>
<z2>9.1</z2>
<c1>S</c1>
<R>2.5</R>
</surface>
<surface>
<x1>3.6</x1>
<y1>4.5</y1>
<z1>5.1</z1>
<x2>6.3</x2>
<y2>7.4</y2>
<z2>8.2</z2>
<c1>G</c1>
<R>3.1</R>
</surface>
</surfaces>
</R-Profile>
<R-Profile>
<id>c1</id>
<surfaces>
<surface>
<x1>2.1</x1>
<y1>3.1</y1>
<z1>4.1</z1>
<x2>5.1</x2>
<y2>2.1</y2>
<z2>7.1</z2>
<c1>G</c1>
<R>8.1</R>
</surface>
<surface>
<x1>1.5</x1>
<y1>1.5</y1>
<z1>1.5</z1>
<x2>1.5</x2>
<y2>1.5</y2>
<z2>1.5</z2>
<c1>A</c1>
<R>1.5</R>
</surface>
</surfaces>
</R-Profile>
<R-Profile>
<id>c2</id>
<surfaces>
<surface>
<x1>6.1</x1>
<y1>7.1</y1>
<z1>8.1</z1>
<x2>9.1</x2>
<y2>2.1</y2>
<z2>11.1</z2>
<c1>S</c1>
<R>3.2</R>
</surface>
</surfaces>
</R-Profile>
</R-Profiles>

Python / Get unique tokens from a file with a exception

I want to find the number of unique tokens in a file. For this purpose I wrote the below code:
splittedWords = open('output.txt', encoding='windows-1252').read().lower().split()
uniqueValues = set(splittedWords)
print(uniqueValues)
The output.txt file is like this:
Türkiye+Noun ,+Punc terörizm+Noun+Gen ve+Conj kitle+Noun imha+Noun silah+Noun+A3pl+P3sg+Gen küresel+Adj düzey+Noun+Loc olus+Verb+Caus+PastPart+P3sg tehdit+Noun+Gen boyut+Noun+P3sg karsi+Adj+P3sg+Loc ,+Punc tüm+Det ülke+Noun+A3pl+Gen yay+Verb+Pass+Inf2+Gen önle+Verb+Pass+Inf2+P3sg hedef+Noun+A3pl+P3sg+Acc paylas+Verb+PastPart+P3pl ,+Punc daha+Noun güven+Noun+With ve+Conj istikrar+Noun+With bir+Num dünya+Noun düzen+Noun+P3sg için+PostpPCGen birlik+Noun+Loc çaba+Noun göster+Verb+PastPart+P3pl bir+Num asama+Noun+Dat gel+Verb+Pass+Inf2+P3sg+Acc samimi+Adj ol+Verb+ByDoingSo arzula+Verb+Prog2+Cop .+Punc
Ab+Noun ile+PostpPCNom gümrük+Noun Alan+Noun+P3sg+Loc+Rel kurumsal+Adj iliski+Noun+A3pl
club+Noun toplanti+Noun+A3pl+P3sg
Türkiye+Noun+Gen -+Punc At+Noun gümrük+Noun isbirlik+Noun+P3sg komite+Noun+P3sg ,+Punc Ankara+Noun Anlasma+Noun+P3sg+Gen 6+Num madde+Noun+P3sg uyar+Verb+When ortaklik+Noun rejim+Noun+P3sg+Gen uygula+Verb+Pass+Inf2+P3sg+Acc ve+Conj gelis+Verb+Inf2+P3sg+Acc sagla+Verb+Inf1 üzere+PostpPCNom ortaklik+Noun Konsey+Noun+P3sg+Gen 2+Num /+Punc 69+Num sayili+Adj karar+Noun+P3sg ile+Conj teknik+Noun komite+Noun mahiyet+Noun+P3sg+Loc kur+Verb+Pass+Narr+Cop .+Punc
nispi+Adj
nisbi+Adj
görece+Adj+With
izafi+Adj
obur+Adj
With this code I can get the unique tokens like Türkiye+Noun, Türkiye+Noun+Gen. But I want to get forexample Türkiye+Noun, Türkiye+Noun+Gen like only one token before the + sign. I only want Türkiye part. In the end Türkiye+Noun and Türkiye+Noun+Gen tokens needs to be same and only treated as a single unique token. I think I need to write regex for this purpose.
It seems the word you want is always the 1st in a list of '+'-joined words:
Split the splitted words at + and take the 0th one:
text = """Türkiye+Noun ,+Punc terörizm+Noun+Gen ve+Conj kitle+Noun imha+Noun silah+Noun+A3pl+P3sg+Gen küresel+Adj düzey+Noun+Loc olus+Verb+Caus+PastPart+P3sg tehdit+Noun+Gen boyut+Noun+P3sg karsi+Adj+P3sg+Loc ,+Punc tüm+Det ülke+Noun+A3pl+Gen yay+Verb+Pass+Inf2+Gen önle+Verb+Pass+Inf2+P3sg hedef+Noun+A3pl+P3sg+Acc paylas+Verb+PastPart+P3pl ,+Punc daha+Noun güven+Noun+With ve+Conj istikrar+Noun+With bir+Num dünya+Noun düzen+Noun+P3sg için+PostpPCGen birlik+Noun+Loc çaba+Noun göster+Verb+PastPart+P3pl bir+Num asama+Noun+Dat gel+Verb+Pass+Inf2+P3sg+Acc samimi+Adj ol+Verb+ByDoingSo arzula+Verb+Prog2+Cop .+Punc
Ab+Noun ile+PostpPCNom gümrük+Noun Alan+Noun+P3sg+Loc+Rel kurumsal+Adj iliski+Noun+A3pl
club+Noun toplanti+Noun+A3pl+P3sg
Türkiye+Noun+Gen -+Punc At+Noun gümrük+Noun isbirlik+Noun+P3sg komite+Noun+P3sg ,+Punc Ankara+Noun Anlasma+Noun+P3sg+Gen 6+Num madde+Noun+P3sg uyar+Verb+When ortaklik+Noun rejim+Noun+P3sg+Gen uygula+Verb+Pass+Inf2+P3sg+Acc ve+Conj gelis+Verb+Inf2+P3sg+Acc sagla+Verb+Inf1 üzere+PostpPCNom ortaklik+Noun Konsey+Noun+P3sg+Gen 2+Num /+Punc 69+Num sayili+Adj karar+Noun+P3sg ile+Conj teknik+Noun komite+Noun mahiyet+Noun+P3sg+Loc kur+Verb+Pass+Narr+Cop .+Punc
nispi+Adj
nisbi+Adj
görece+Adj+With
izafi+Adj
obur+Adj """
splittedWords = text.lower().replace("\n"," ").split()
uniqueValues = set( ( s.split("+")[0] for s in splittedWords))
print(uniqueValues)
Output:
{'imha', 'çaba', 'ülke', 'arzula', 'terörizm', 'olus', 'daha', 'istikrar', 'küresel',
'sagla', 'önle', 'üzere', 'nisbi', 'türkiye', 'gelis', 'bir', 'karar', 'hedef', '2',
've', 'silah', 'kur', 'alan', 'club', 'boyut', '-', 'anlasma', 'iliski',
'izafi', 'kurumsal', 'karsi', 'ankara', 'ortaklik', 'obur', 'kitle', 'güven',
'uygula', 'ol', 'düzey', 'konsey', 'teknik', 'rejim', 'komite', 'gümrük', 'samimi',
'gel', 'yay', 'toplanti', '.', 'asama', 'mahiyet', 'ab', '69', 'için',
'paylas', '6', '/', 'nispi', 'dünya', 'at', 'sayili', 'görece', 'isbirlik', 'birlik',
',', 'tüm', 'ile', 'düzen', 'uyar', 'göster', 'tehdit', 'madde'}
You might need to do some additional cleanup to remove things like
',' '6' '/'
Split and remove anything thats just numbers or punctuation
from string import digits, punctuation
remove=set(digits+punctuation)
splittedWords = text.lower().split()
uniqueValues = set( ( s.split("+")[0] for s in splittedWords))
# remove from set anything that only consists of numbers or punctuation
uniqueValues = uniqueValues - set ( x for x in uniqueValues if all(c in remove for c in x))
print(uniqueValues)
to get it as:
{'teknik', 'yay', 'göster','hedef', 'terörizm', 'ortaklik','ile', 'daha', 'ol', 'istikrar',
'paylas', 'nispi', 'üzere', 'sagla', 'tüm', 'önle', 'asama', 'uygula', 'güven', 'kur',
'türkiye', 'gel', 'dünya', 'gelis', 'sayili', 'ab', 'club', 'küresel', 'imha', 'çaba',
'olus', 'iliski', 'izafi', 'mahiyet', 've', 'düzey', 'anlasma', 'tehdit', 'bir', 'düzen',
'obur', 'samimi', 'boyut', 'ülke', 'arzula', 'rejim', 'gümrük', 'karar', 'at', 'karsi',
'nisbi', 'isbirlik', 'alan', 'toplanti', 'ankara', 'birlik', 'kurumsal', 'için', 'kitle',
'komite', 'silah', 'görece', 'uyar', 'madde', 'konsey'}
You can split all the tokens you have now on "+" and take only the first one.
uniqueValues = set(map(lambda x: x.split('+')[0], splittedWords))
Here I use map. Map will apply the function (the lambda part) on all values of the splittedWords.

Python unicode decode not working for outlook exported csv

Hi I've exported an outlook contacts csv file and loaded it into a python shell.
I have a number of European names in the list and the following for example
tmp = 'Fern\xc3\x9fndez'
tmp.encode("latin-1")
results in an error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)
while
tmp.decode('latin-1')
gives me
u'Fern\xc3\x9fndez'
How do I get the text to read as Fernandez? (not too worried about the accents - but happy to have them)
You must be using Python 2.x. Here is one way to print out the character (depending on which encoding you are working with):
>>> tmp = 'Fern\xc3\x9fndez'
>>> print tmp.decode('utf-8') # print formats the string for stdout
Fernßndez
>>> print tmp.decode('latin1')
FernÃndez
Are you sure you have the right character? Is it utf-8? And another way:
>>> print unicode(tmp, 'latin1')
FernÃndez
>>> print unicode(tmp, 'utf-8')
Fernßndez
Interesting. So none of these options worked for you? Incidentally, I ran the string through a few other encodings just to see if any of them had a character more in line with what I would expect. Unfortunately, I don't see any that look quite right:
>>> for encoding in ['ascii', 'big5', 'big5hkscs', 'cp037', 'cp424', 'cp437', 'cp500', 'cp737', 'cp775', 'cp850', 'cp852', 'cp855', 'cp856', 'cp857', 'cp860', 'cp861', 'cp862', 'cp863', 'cp864', 'cp865', 'cp866', 'cp869', 'cp874', 'cp875', 'cp932', 'cp949', 'cp950', 'cp1006', 'cp1026', 'cp1140', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254', 'cp1255', 'cp1256', 'cp1257', 'cp1258', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213', 'euc_kr', 'gb2312', 'gbk', 'gb18030', 'hz', 'iso2022_jp', 'iso2022_jp_1', 'iso2022_jp_2', 'iso2022_jp_2004', 'iso2022_jp_3', 'iso2022_jp_ext', 'iso2022_kr', 'latin_1', 'iso8859_2', 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6', 'iso8859_7', 'iso8859_8', 'iso8859_9', 'iso8859_10', 'iso8859_13', 'iso8859_14', 'iso8859_15', 'johab', 'koi8_r', 'koi8_u', 'mac_cyrillic', 'mac_greek', 'mac_iceland', 'mac_latin2', 'mac_roman', 'mac_turkish', 'ptcp154', 'shift_jis', 'shift_jis_2004', 'shift_jisx0213', 'utf_16', 'utf_16_be', 'utf_16_le', 'utf_7', 'utf_8']:
try:
print encoding + ': ' + tmp.decode(encoding)
except:
pass
cp037: ãÁÊ>C¤>ÀÁ:
cp437: Fernßndez
cp500: ãÁÊ>C¤>ÀÁ:
cp737: Fern├θndez
cp775: Fern├¤ndez
cp850: Fernßndez
cp852: Fern├čndez
cp855: Fern├Ъndez
cp857: Fern├şndez
cp860: Fern├Óndez
cp861: Fernßndez
cp862: Fernßndez
cp863: Fernßndez
cp865: Fernßndez
cp866: Fern├Яndez
cp869: Fern├ίndez
cp875: ΖΧΈ>Cμ>ΦΧ:
cp932: Fernテ殤dez
cp949: Fern횩ndez
cp1006: Fernﺣndez
cp1026: ãÁÊ>C¤>ÀÁ:
cp1140: ãÁÊ>C€>ÀÁ:
cp1250: FernĂźndez
cp1251: FernГџndez
cp1252: Fernßndez
cp1254: Fernßndez
cp1256: Fernأںndez
cp1258: FernĂŸndez
gbk: Fern脽ndez
gb18030: Fern脽ndez
latin_1: FernÃndez
iso8859_2: FernĂndez
iso8859_4: FernÃndez
iso8859_5: FernУndez
iso8859_6: Fernأndez
iso8859_7: FernΓndez
iso8859_9: FernÃndez
iso8859_10: FernÃndez
iso8859_13: FernĆndez
iso8859_14: FernÃndez
iso8859_15: FernÃndez
koi8_r: Fernц÷ndez
koi8_u: Fernц÷ndez
mac_cyrillic: Fern√Яndez
mac_greek: FernΟündez
mac_iceland: Fernßndez
mac_latin2: Fernßndez
mac_roman: Fernßndez
mac_turkish: Fernßndez
ptcp154: FernГҹndez
shift_jis: Fernテ殤dez
shift_jis_2004: Fernテ殤dez
shift_jisx0213: Fernテ殤dez
utf_16: 敆湲鿃摮穥
utf_16_be: 䙥牮쎟湤敺
utf_16_le: 敆湲鿃摮穥
utf_8: Fernßndez

matching data packets and ICMP packets in case of TCP duplicates

I'm trying to match data packets with the ICMP time-exceeded packets they triggered. Therefore, I'm comparing 28-byte-long strings of each data packet (IP header + 8B of payload) with all (28-byte-long) ICMP payloads.
I'm having problems when I'm sending duplicate TCP packets:
>>> p1
<IP version=4L ihl=5L tos=0x0 len=60 id=0 flags=DF frag=0L ttl=1 proto=tcp chksum=0x7093 src=XXX dst=YYY options=[] |<TCP sport=10743 dport=37901 seq=2939035442L ack=2703569003L dataofs=10L reserved=0L flags=SA window=14480 chksum=0x9529 urgptr=0 options=[('MSS', 1460), ('SAckOK', ''), ('Timestamp', (215365485, 52950)), ('NOP', None), ('WScale', 4)] |>>
>>> p2
<IP version=4L ihl=5L tos=0x0 len=60 id=0 flags=DF frag=0L ttl=1 proto=tcp chksum=0x7093 src=XXX dst=YYY options=[] |<TCP sport=10743 dport=37901 seq=2939035442L ack=2703569003L dataofs=10L reserved=0L flags=SA window=14480 chksum=0x9426 urgptr=0 options=[('MSS', 1460), ('SAckOK', ''), ('Timestamp', (215365744, 52950)), ('NOP', None), ('WScale', 4)] |>>
...whose first 28 bytes are the same, but differ in the rest of the tcp header:
'E\x00\x00<\x00\x00#\x00\x01\x06p\x93\x8a`t\x86\xb2.X\x14)\xf7\x94\r\xaf.\x1f2'
'E\x00\x00<\x00\x00#\x00\x01\x06p\x93\x8a`t\x86\xb2.X\x14)\xf7\x94\r\xaf.\x1f2'
The ICMP packets I got have thus the same payload:
>>> i1[ICMP]
<ICMP type=time-exceeded code=ttl-zero-during-transit chksum=0x689a unused=0 |<IPerror version=4L ihl=5L tos=0x0 len=60 id=0 flags=DF frag=0L ttl=1 proto=tcp chksum=0x7093 src=XXX dst=YYY options=[] |<TCPerror sport=10743 dport=37901 seq=2939035442L |>>>
>>> i2[ICMP]
<ICMP type=time-exceeded code=ttl-zero-during-transit chksum=0x689a unused=0 |<IPerror version=4L ihl=5L tos=0x0 len=60 id=0 flags=DF frag=0L ttl=1 proto=tcp chksum=0x7093 src=XXX dst=YYY options=[] |<TCPerror sport=10743 dport=37901 seq=2939035442L |>>>
Corresponding strings are:
'E\x00\x00<\x00\x00#\x00\x01\x06p\x93\x8a`t\x86\xb2.X\x14)\xf7\x94\r\xaf.\x1f2'
'E\x00\x00<\x00\x00#\x00\x01\x06p\x93\x8a`t\x86\xb2.X\x14)\xf7\x94\r\xaf.\x1f2'
Right now in this particular case I'm claiming that a1 matches i1 because between i1 and i2, it is i1 that arrived soon after the sending of a1, whereas i2 arrived much later.
Is this enough? What else am I missing?
The header size of a TCP packet is not always 20 bytes. If there are options set, the header could be larger. You can use the Internet Header Length field to find the header size and add the amount of payload you want to that number.
Scapy: how do I get the full IP packet header?

Categories

Resources