Invalid character "\u64e" in token Pylance - python
What is the meaning of this error Invalid character "\u64e" in token Pylance the read error line under Acc for this code, How can fixed it?
err = calculateCError()
print('Error is:', err, '%')
َAcc = 100 - err
print('َAccuracy is:',َAcc , '%')
Here's how to debug something like this:
s = """err = calculateCError()
print('Error is:', err, '%')
َAcc = 100 - err
print('َAccuracy is:',َAcc , '%')"""
print([hex(ord(c)) for c in s])
['0x65', '0x72', '0x72', '0x20', '0x3d', '0x20', '0x63', '0x61', '0x6c',
'0x63', '0x75', '0x6c', '0x61', '0x74', '0x65', '0x43', '0x45', '0x72',
'0x72', '0x6f', '0x72', '0x28', '0x29', '0xa', '0x70', '0x72', '0x69',
'0x6e', '0x74', '0x28', '0x27', '0x45', '0x72', '0x72', '0x6f', '0x72',
'0x20', '0x69', '0x73', '0x3a', '0x27', '0x2c', '0x20', '0x65', '0x72',
'0x72', '0x2c', '0x20', '0x27', '0x25', '0x27', '0x29', '0xa', '0x64e',
'0x41', '0x63', '0x63', '0x20', '0x3d', '0x20', '0x31', '0x30', '0x30',
'0x20', '0x2d', '0x20', '0x65', '0x72', '0x72', '0xa', '0x70', '0x72',
'0x69', '0x6e', '0x74', '0x28', '0x27', '0x64e', '0x41', '0x63', '0x63',
'0x75', '0x72', '0x61', '0x63', '0x79', '0x20', '0x69', '0x73', '0x3a',
'0x27', '0x2c', '0x64e', '0x41', '0x63', '0x63', '0x20', '0x2c', '0x20',
'0x27', '0x25', '0x27', '0x29']
And sure enough, there are three instances of 0x64E, always appearing before 0x41 (A). In fact, if you look carefully at your A characters, you will notice a faint slanted accent line above the A. This is called Arabic Fatha in Unicode. Here is a 320% zoom from my browser showing it more obviously:
Related
Sudoku checker issues with Python
I'm trying to create a sudoku checker in Python. I found a version here in another thread, but it does not work properly. I wonder what is the issue? I receive the following error: Traceback (most recent call last): File "C:\Users\Omistaja\Downloads\sudoku_checker_template.py", line 72, in <module> main() File "C:\Users\Omistaja\Downloads\sudoku_checker_template.py", line 63, in main is_valid = check_sudoku_grid(grid) File "C:\Users\Omistaja\Downloads\sudoku_checker_template.py", line 20, in check_sudoku_grid if grid[row][col] < 1 or type(grid[row][col]) is not type(1): TypeError: '<' not supported between instances of 'NoneType' and 'int' Anyway, below is the whole thing. Only the check_sudoku_grid should be modified, the rest should work. Thanks for your help! from grids import GRID_NAMES, GRID_RETURNS, GRIDS, GRIDS_BIG, GRIDS_SMALL GRID_SIZE = 9 # Length of one side of the sudoku SUBGRID_SIZE = 3 # Length of one side of a cell of the sudoku def check_sudoku_grid(grid): """ Parameter : GRID_SIZE * GRID_SIZE two-dimensional list Return value : Boolean (True/False) Checks whether a sudoku grid is valid ie. doesn't contain any duplicates (besides None) in any row, column or cell. """ for row in range(len(grid)): for col in range(len(grid)): # check value is an int if grid[row][col] < 1 or type(grid[row][col]) is not type(1): return False # check value is within 1 through n. # for example a 2x2 grid should not have the value 8 in it elif grid[row][col] > len(grid): return False # check the rows for row in grid: if sorted(list(set(row))) != sorted(row): return False # check the cols cols = [] for col in range(len(grid)): for row in grid: cols += [row[col]] # set will get unique values, its converted to list so you can compare # it's sorted so the comparison is done correctly. if sorted(list(set(cols))) != sorted(cols): return False cols = [] # if you get past all the false checks return True return True def print_grid(grid): for i in range(GRID_SIZE): row = "" for j in range(GRID_SIZE): try: val = int(grid[i][j]) except TypeError: val = "_" except ValueError: val = grid[i][j] row += "{} ".format(val) if j % SUBGRID_SIZE == SUBGRID_SIZE - 1: row += " " print(row) if i % SUBGRID_SIZE == SUBGRID_SIZE - 1: print() def main(): i = 0 for grid in GRIDS: is_valid = check_sudoku_grid(grid) print("This grid {:s}.".format(GRID_NAMES[i])) print("Your function should return: {:s}".format(GRID_RETURNS[i])) print("Your function returns: {}".format(is_valid)) print_grid(grid) i += 1 main()
GRID_NAMES = ["is valid", "is valid containing None values", "is valid containing None values (2)", \ "has an invalid row", "has an invalid column", "has an invalid subgrid"] GRID_RETURNS = ["True","True","True","False","False","False"] n = None a = 'a' b = 'b' c = 'c' d = 'd' e = 'e' f = 'f' g = 'g' GRID_VALID = [[7,3,5, 6,1,4, 8,9,2], [8,4,2, 9,7,3, 5,6,1], [9,6,1, 2,8,5, 3,7,4], [2,8,6, 3,4,9, 1,5,7], [4,1,3, 8,5,7, 9,2,6], [5,7,9, 1,2,6, 4,3,8], [1,5,7, 4,9,2, 6,8,3], [6,9,4, 7,3,8, 2,1,5], [3,2,8, 5,6,1, 7,4,9] ] GRID_VALID_NONE = [[7,3,5, 6,1,4, 8,9,2], [8,4,2, 9,7,3, 5,6,1], [9,6,1, 2,8,5, 3,7,4], [2,n,n, 3,4,n, 1,5,7], [4,1,3, 8,5,7, 9,2,6], [5,7,9, 1,2,6, 4,3,8], [1,5,7, 4,9,n, 6,8,3], [6,9,4, 7,3,8, 2,1,5], [n,2,8, 5,6,1, 7,4,9] ] GRID_VALID_NONE_2 = [[7,3,5, 6,1,4, n,9,2], [8,4,2, 9,7,3, 5,6,1], [n,n,1, 2,8,5, 3,7,4], [2,n,n, 3,4,n, 1,5,7], [4,1,3, 8,5,7, 9,2,6], [5,n,9, 1,2,6, 4,3,8], [1,5,7, 4,9,n, n,8,3], [6,9,4, 7,3,8, 2,1,5], [n,2,8, 5,6,1, 7,4,n] ] GRID_INVALID_SUBGRID = [[7,3,5, 6,1,4, 8,9,2], [8,4,2, 9,7,3, 5,6,1], [9,6,1, 2,8,5, 3,7,4], [2,8,6, 3,4,9, 1,5,7], [4,1,3, n,5,7, 9,2,6], [5,7,9, 1,2,6, 4,3,8], [1,5,7, 4,9,2, 6,8,3], [6,9,4, 7,3,8, 2,1,5], [3,2,n, 8,6,1, 7,4,9] ] GRID_INVALID_ROW = [[7,3,5, 6,1,4, 8,9,2], [8,4,2, 9,7,3, 5,6,1], [9,6,1, 2,8,5, 3,7,4], [2,8,6, 3,4,9, 1,5,7], [4,1,3, 8,5,7, 9,2,6], [5,7,9, 1,2,6, 4,3,8], [1,5,7, 4,9,2, 6,8,n], [6,9,4, 7,3,8, 2,1,3], [3,2,8, 5,6,1, 7,4,9] ] GRID_INVALID_COLUMN = [[7,3,5, 6,1,4, 8,9,2], [8,4,2, 9,7,3, 5,6,1], [9,6,1, 2,8,5, 3,7,4], [2,8,6, 3,4,9, 1,5,7], [4,1,3, 8,5,7, 9,2,6], [5,7,9, 1,2,6, 4,3,8], [1,5,n, 4,9,2, 6,8,3], [6,9,4, 7,3,8, 2,1,5], [7,2,8, 5,6,1, n,4,9] ] GRIDS = [GRID_VALID, GRID_VALID_NONE, GRID_VALID_NONE_2, \ GRID_INVALID_ROW, GRID_INVALID_COLUMN, GRID_INVALID_SUBGRID] GRID_SMALL_VALID = [[1,2, 3,4], [3,4, 1,2], [2,3, 4,1], [4,1, 2,3]] GRID_SMALL_VALID_NONE = [[1,n, 3,4], [3,4, n,n], [2,n, 4,1], [4,1, n,3]] GRID_SMALL_VALID_NONE_2 = [[1,n, 3,4], [n,n, n,2], [2,n, 4,1], [4,n, 2,3]] GRID_SMALL_INVALID_ROW = [[1,2, 3,n], [2,3, 4,4], [3,4, 1,2], [4,1, 2,3]] GRID_SMALL_INVALID_COLUMN = [[1,2, 3,4], [2,3, 4,1], [3,4, n,1], [4,1, 2,3]] GRID_SMALL_INVALID_SUBGRID = [[1,2, 3,4], [2,3, 4,1], [3,4, 1,2], [4,1, 2,3]] GRIDS_SMALL = [GRID_SMALL_VALID, GRID_SMALL_VALID_NONE, GRID_SMALL_VALID_NONE_2, \ GRID_SMALL_INVALID_ROW, GRID_SMALL_INVALID_COLUMN, GRID_SMALL_INVALID_SUBGRID] GRID_BIG_VALID = [[4,a,9,f, 1,7,d,8, 6,e,2,c, g,5,3,b], [2,5,3,1, f,4,b,g, d,9,8,7, 6,a,c,e], [e,6,d,c, 3,a,5,2, g,b,1,4, 8,f,9,7], [b,7,g,8, 6,e,9,c, 5,3,a,f, 1,2,d,4], [8,g,b,4, d,f,e,9, 2,5,7,3, c,1,a,6], [1,e,6,d, c,8,4,5, a,g,9,b, 2,3,7,f], [a,f,5,3, 2,1,6,7, 4,c,e,8, 9,b,g,d], [c,2,7,9, b,3,g,a, f,d,6,1, 4,8,e,5], [9,4,1,a, e,2,3,d, b,f,c,6, 7,g,5,8], [5,8,e,g, 7,9,1,6, 3,4,d,a, b,c,f,2], [7,3,f,6, g,b,c,4, 8,2,5,9, e,d,1,a], [d,c,2,b, a,5,8,f, 7,1,g,e, 3,6,4,9], [f,9,8,2, 4,c,7,3, 1,a,b,d, 5,e,6,g], [6,d,c,5, 9,g,f,1, e,8,4,2, a,7,b,3], [g,b,4,7, 8,d,a,e, c,6,3,5, f,9,2,1], [3,1,a,e, 5,6,2,b, 9,7,f,g, d,4,8,c] ] GRID_BIG_VALID_NONE = [[4,a,9,n, 1,7,d,8, 6,e,n,c, g,5,3,n], [n,5,3,1, f,n,b,g, d,9,8,7, 6,a,c,e], [e,6,d,c, 3,a,5,2, g,b,n,4, n,f,n,7], [b,7,n,8, n,e,9,c, n,3,a,f, 1,2,d,4], [8,g,b,4, d,f,n,9, 2,5,7,n, c,1,a,6], [1,e,n,d, c,n,4,5, a,g,n,b, 2,3,7,f], [a,f,n,3, 2,1,n,7, n,n,e,8, 9,b,g,n], [c,2,7,9, b,3,g,a, f,d,6,1, 4,n,n,5], [9,4,1,a, e,n,3,d, b,f,c,6, 7,g,5,8], [5,n,e,g, 7,9,n,6, 3,4,d,a, b,n,f,2], [7,3,f,6, g,b,c,4, n,n,5,9, e,d,n,a], [n,n,n,b, a,5,8,f, 7,1,n,e, 3,6,4,9], [f,9,8,2, 4,c,7,3, n,n,b,d, 5,e,6,g], [6,n,c,5, 9,n,f,1, e,n,4,2, a,7,n,3], [g,b,4,7, 8,d,a,e, c,6,n,5, f,9,2,n], [3,1,n,n, n,6,2,b, 9,7,f,g, d,4,8,c] ] GRID_BIG_VALID_NONE_2 = [[4,a,9,f, 1,7,d,n, 6,e,n,n, g,n,3,b], [2,5,3,1, f,4,b,g, d,n,8,7, 6,a,n,e], [e,6,d,c, 3,a,n,2, g,b,1,4, 8,f,9,7], [b,7,g,n, n,e,9,c, 5,3,a,n, n,2,d,4], [8,g,b,4, d,f,e,n, 2,5,7,3, c,1,a,6], [1,n,6,d, n,n,4,n, a,g,n,b, 2,3,7,f], [a,f,5,3, 2,1,6,7, 4,c,e,8, 9,b,g,d], [c,2,7,n, b,3,g,a, f,d,6,1, 4,8,e,5], [9,4,1,a, e,2,n,n, b,f,c,n, 7,g,5,8], [5,n,e,g, 7,9,n,6, 3,4,d,a, b,c,f,2], [7,3,f,6, g,b,c,4, 8,n,n,n, e,d,1,a], [d,c,2,n, a,n,8,f, 7,1,g,n, 3,6,n,9], [f,n,8,2, 4,c,7,3, 1,a,b,d, n,e,6,n], [6,d,c,5, 9,g,f,1, e,8,4,2, a,7,b,3], [g,b,n,7, 8,d,a,e, n,6,n,5, f,n,2,n], [n,1,a,e, n,6,2,b, 9,n,f,g, d,n,8,c] ] GRID_BIG_INVALID_ROW = [[4,a,9,f, 1,7,d,8, 6,e,2,c, g,5,3,b], [2,5,3,1, f,4,b,g, d,9,8,7, 6,a,c,e], [e,6,d,c, 3,a,5,2, g,b,1,4, 8,f,9,7], [b,7,g,8, 6,e,9,c, 5,3,a,f, 1,2,d,4], [8,g,b,4, d,f,e,9, 2,5,7,3, c,1,a,6], [1,e,6,d, c,n,4,5, a,g,9,b, 2,3,7,f], [a,f,5,3, 2,1,6,7, 4,c,e,8, 9,b,g,d], [c,2,7,9, b,3,g,a, f,d,6,1, 4,8,e,5], [9,4,1,a, e,2,3,d, b,f,c,6, 7,g,5,8], [5,8,e,g, 7,9,1,6, 3,4,d,a, b,c,f,2], [7,3,f,6, g,b,c,4, 8,2,5,9, e,d,1,a], [d,c,2,b, a,8,8,f, 7,1,g,e, 3,6,4,9], [f,9,8,2, 4,c,7,3, 1,a,b,d, 5,e,6,g], [6,d,c,5, 9,g,f,1, e,8,4,2, a,7,b,3], [g,b,4,7, 8,d,a,e, c,6,3,5, f,9,2,1], [3,1,a,e, 5,6,2,b, 9,7,f,g, d,4,8,c] ] GRID_BIG_INVALID_COLUMN = [[4,a,9,f, 1,7,d,8, 6,e,2,c, g,5,3,b], [2,5,3,1, f,4,b,g, d,9,8,7, 6,a,c,e], [e,6,d,c, 3,a,5,2, g,b,1,4, 8,f,9,7], [b,7,g,8, 6,e,9,c, 5,3,a,f, 1,2,d,4], [8,g,b,4, d,f,e,9, 2,5,7,3, c,1,a,6], [1,e,6,d, c,8,4,5, a,g,9,b, 2,3,7,f], [a,f,5,3, 2,1,6,n, 4,c,e,8, 9,b,7,d], [c,2,7,9, b,3,g,a, f,d,6,1, 4,8,e,5], [9,4,1,a, e,2,3,d, b,f,c,6, 7,g,5,8], [5,8,e,g, 7,9,1,6, 3,4,d,a, b,c,f,2], [7,3,f,6, g,b,c,4, 8,2,5,9, e,d,1,a], [d,c,2,b, a,5,8,f, 7,1,g,e, 3,6,4,9], [f,9,8,2, 4,c,7,3, 1,a,b,d, 5,e,6,g], [6,d,c,5, 9,g,f,1, e,8,4,2, a,7,b,3], [g,b,4,7, 8,d,a,e, c,6,3,5, f,9,2,1], [3,1,a,e, 5,6,2,b, 9,7,f,g, d,4,8,c] ] GRID_BIG_INVALID_SUBGRID = [[4,a,9,f, 1,7,d,8, 6,e,2,c, g,5,3,b], [2,5,3,1, f,4,b,g, d,9,8,7, 6,a,c,e], [e,6,d,c, 3,a,5,2, g,b,1,4, 8,f,9,7], [b,7,g,8, 6,e,9,c, 5,3,a,f, 1,2,d,4], [8,g,b,4, d,f,e,9, 2,5,7,3, c,1,a,6], [1,e,6,d, c,8,4,5, a,g,9,b, 2,3,7,f], [a,f,5,3, 2,1,6,7, 4,c,e,8, 9,b,g,d], [c,2,7,9, b,3,g,a, f,d,6,1, 4,8,e,5], [9,4,1,a, e,2,3,d, b,f,c,6, 7,g,5,8], [5,8,e,g, 7,9,1,6, 3,n,d,a, b,c,f,2], [7,3,f,6, g,b,c,4, 8,2,5,9, e,d,1,a], [d,c,2,b, a,5,8,f, 7,1,g,e, 3,6,4,9], [f,9,8,2, 4,c,7,3, 1,a,b,d, 5,e,6,g], [6,d,c,5, 9,g,f,1, e,8,4,2, a,7,b,3], [g,b,n,7, 8,d,a,e, c,4,3,5, f,9,2,1], [3,1,a,e, 5,6,2,b, 9,7,f,g, d,4,8,c] ] GRIDS_BIG = [GRID_BIG_VALID, GRID_BIG_VALID_NONE, GRID_BIG_VALID_NONE_2, \ GRID_BIG_INVALID_ROW, GRID_BIG_INVALID_COLUMN, GRID_BIG_INVALID_SUBGRID]
With PYTHON convert CSV file to XML file
I want convert a csv file to xml file with python. I want to group the same id's in the csv file together and convert the csv in to convert xml( see desired ouput ). Its a bit complex than it looks with indentation, looping and grouping the csv to xml. All help is appreciated. My CSV file: id,x1,y1,z1,x2,y2,z2,c1,R a1,1.3,2.1,3.6,4.5,5.1,6.8,B,7.3 b2,1.1,2.1,3.1,4.1,5.1,6.1,G,7.1 c1,2.1,3.1,4.1,5.1,2.1,7.1,G,8.1 a1,2.2,3.2,4.2,5.2,6.2,7.2,S,8.2 b2,4.1,5.1,2.1,7.1,8.1,9.1,S,2.5 b2,3.6,4.5,5.1,6.3,7.4,8.2,G,3.1 c2,6.1,7.1,8.1,9.1,2.1,11.1,S,3.2 c1,1.5,1.5,1.5,1.5,1.5,1.5,A,1.5 my code: import itertools import csv import os csvFile = r'C:\Users\Desktop\test XML\csvfile.csv' xmlFile = r'C:\Users\Desktop\test XML\myData.xml' csvData = csv.reader(open(csvFile)) xmlData = open(xmlFile, 'w') xmlData.write('<?xml version="1.0" encoding="UTF-8"?>' + "\n" +'<Roughness-Profiles xmlns="http://WKI/Roughness-Profiles/1">' + "\n" ) xmlData.write(' '+'<Roughness-Profile>' + "\n") rowNum = 0 for row in csvData: if rowNum == 0: tags = row # replace spaces w/ underscores in tag names for i in range(len(tags)): tags[i] = tags[i].replace(' ', '_') else: xmlData.write(' '+'<surfaces>' +"\n"+' '+'<surface>' + "\n") for i in range (len(tags)): xmlData.write(' ' +'<' + tags[i] + '>' \ + row[i] + '</' + tags[i] + '>' + "\n") xmlData.write(' '+'</surface>' + "\n" + ' '+'</surfaces>' + "\n" + ' '+'</Roughness-Profile>' + "\n") rowNum +=1 xmlData.write('</Roughness-Profiles>' + "\n") xmlData.close() my xml output: <?xml version="1.0" encoding="UTF-8"?> <Roughness-Profiles xmlns="http://WKI/Roughness-Profiles/1"> <Roughness-Profile> <surfaces> <surface> <id>a1</id> <x1>1.3</x1> <y1>2.1</y1> <z1>3.6</z1> <x2>4.5</x2> <y2>5.1</y2> <z2>6.8</z2> <c1>B</c1> <R>7.3</R> </surface> </surfaces> </Roughness-Profile> <surfaces> <surface> <id>b2</id> <x1>1.1</x1> <y1>2.1</y1> <z1>3.1</z1> <x2>4.1</x2> <y2>5.1</y2> <z2>6.1</z2> <c1>G</c1> <R>7.1</R> </surface> </surfaces> </Roughness-Profile> <surfaces> <surface> <id>c1</id> <x1>2.1</x1> <y1>3.1</y1> <z1>4.1</z1> <x2>5.1</x2> <y2>2.1</y2> <z2>7.1</z2> <c1>G</c1> <R>8.1</R> </surface> </surfaces> </Roughness-Profile> <surfaces> <surface> <id>a1</id> <x1>2.2</x1> <y1>3.2</y1> <z1>4.2</z1> <x2>5.2</x2> <y2>6.2</y2> <z2>7.2</z2> <c1>S</c1> <R>8.2</R> </surface> </surfaces> </Roughness-Profile> <surfaces> <surface> <id>b2</id> <x1>4.1</x1> <y1>5.1</y1> <z1>2.1</z1> <x2>7.1</x2> <y2>8.1</y2> <z2>9.1</z2> <c1>S</c1> <R>2.5</R> </surface> </surfaces> </Roughness-Profile> <surfaces> <surface> <id>b2</id> <x1>3.6</x1> <y1>4.5</y1> <z1>5.1</z1> <x2>6.3</x2> <y2>7.4</y2> <z2>8.2</z2> <c1>G</c1> <R>3.1</R> </surface> </surfaces> </Roughness-Profile> <surfaces> <surface> <id>c2</id> <x1>6.1</x1> <y1>7.1</y1> <z1>8.1</z1> <x2>9.1</x2> <y2>2.1</y2> <z2>11.1</z2> <c1>S</c1> <R>3.2</R> </surface> </surfaces> </Roughness-Profile> <surfaces> <surface> <id>c1</id> <x1>1.5</x1> <y1>1.5</y1> <z1>1.5</z1> <x2>1.5</x2> <y2>1.5</y2> <z2>1.5</z2> <c1>A</c1> <R>1.5</R> </surface> </surfaces> </Roughness-Profile> </Roughness-Profiles> Desired output should be: <?xml version="1.0" encoding="UTF-8"?> <R-Profiles xmlns="http://WKI/R-Profiles/1"> <R-Profile> <id>a1</id> <surfaces> <surface> <x1>1.3</x1> <y1>2.1</y1> <z1>3.6</z1> <x2>4.5</x2> <y2>5.1</y2> <z2>6.8</z2> <c1>B</c1> <R>7.3</R> </surface> <surface> <x1>2.2</x1> <y1>3.2</y1> <z1>4.2</z1> <x2>5.2</x2> <y2>6.2</y2> <z2>7.2</z2> <c1>S</c1> <R>8.2</R> </surface> </surfaces> </R-Profile> <R-Profile> <id>b2</id> <surfaces> <surface> <x1>1.1</x1> <y1>2.1</y1> <z1>3.1</z1> <x2>4.1</x2> <y2>5.1</y2> <z2>6.1</z2> <c1>G</c1> <R>7.1</R> </surface> <surface> <x1>4.1</x1> <y1>5.1</y1> <z1>2.1</z1> <x2>7.1</x2> <y2>8.1</y2> <z2>9.1</z2> <c1>S</c1> <R>2.5</R> </surface> <surface> <x1>3.6</x1> <y1>4.5</y1> <z1>5.1</z1> <x2>6.3</x2> <y2>7.4</y2> <z2>8.2</z2> <c1>G</c1> <R>3.1</R> </surface> </surfaces> </R-Profile> <R-Profile> <id>c1</id> <surfaces> <surface> <x1>2.1</x1> <y1>3.1</y1> <z1>4.1</z1> <x2>5.1</x2> <y2>2.1</y2> <z2>7.1</z2> <c1>G</c1> <R>8.1</R> </surface> <surface> <x1>1.5</x1> <y1>1.5</y1> <z1>1.5</z1> <x2>1.5</x2> <y2>1.5</y2> <z2>1.5</z2> <c1>A</c1> <R>1.5</R> </surface> </surfaces> </R-Profile> <R-Profile> <id>c2</id> <surfaces> <surface> <x1>6.1</x1> <y1>7.1</y1> <z1>8.1</z1> <x2>9.1</x2> <y2>2.1</y2> <z2>11.1</z2> <c1>S</c1> <R>3.2</R> </surface> </surfaces> </R-Profile> </R-Profiles>
I would do something very similar to what #Parfait suggested; use csv.DictReader and lxml to create the XML. However, something is missing from that answer; the surface elements aren't grouped by id. If I need to group XML during a transformation, the first thing I think of is XSLT. Once you get the hang of it, grouping is easy with XSLT; especially 2.0 or greater. Unfortunately lxml only supports XSLT 1.0. In 1.0 you need to use Muenchian Grouping. Here's a full example of creating an intermediate XML and transforming it with XSLT. CSV Input (test.csv) id,x1,y1,z1,x2,y2,z2,c1,R a1,1.3,2.1,3.6,4.5,5.1,6.8,B,7.3 b2,1.1,2.1,3.1,4.1,5.1,6.1,G,7.1 c1,2.1,3.1,4.1,5.1,2.1,7.1,G,8.1 a1,2.2,3.2,4.2,5.2,6.2,7.2,S,8.2 b2,4.1,5.1,2.1,7.1,8.1,9.1,S,2.5 b2,3.6,4.5,5.1,6.3,7.4,8.2,G,3.1 c2,6.1,7.1,8.1,9.1,2.1,11.1,S,3.2 c1,1.5,1.5,1.5,1.5,1.5,1.5,A,1.5 XSLT 1.0 (test.xsl) <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:rp="http://WKI/Roughness-Profiles/1"> <xsl:output indent="yes"/> <xsl:strip-space elements="*"/> <xsl:key name="surface" match="rp:surface" use="rp:id"/> <xsl:template match="#*|node()"> <xsl:copy> <xsl:apply-templates select="#*|node()"/> </xsl:copy> </xsl:template> <xsl:template match="/*"> <xsl:copy> <xsl:apply-templates select="#*"/> <xsl:for-each select="rp:surface[count(.|key('surface',rp:id)[1])=1]"> <xsl:element name="Roughness-Profile" namespace="http://WKI/Roughness-Profiles/1"> <xsl:copy-of select="rp:id"/> <xsl:element name="surfaces" namespace="http://WKI/Roughness-Profiles/1"> <xsl:apply-templates select="key('surface',rp:id)"/> </xsl:element> </xsl:element> </xsl:for-each> </xsl:copy> </xsl:template> <xsl:template match="rp:id"/> </xsl:stylesheet> Python import csv import lxml.etree as etree # INITIALIZING XML FILE WITH ROOT IN PROPER NAMESPACE nsmap = {None: "http://WKI/Roughness-Profiles/1"} root = etree.Element('Roughness-Profiles', nsmap=nsmap) # READING CSV FILE with open("test.csv") as f: reader = csv.DictReader(f) # WRITE INITIAL XML NODES for row in reader: surface_elem = etree.SubElement(root, "surface", nsmap=nsmap) for elem_name, elem_value in row.items(): etree.SubElement(surface_elem, elem_name.strip(), nsmap=nsmap).text = str(elem_value) # PARSE XSLT AND CREATE TRANSFORMER xslt_root = etree.parse("test.xsl") transform = etree.XSLT(xslt_root) # TRANSFORM # (Note the weird use of tostring/fromstring. This was used so # namespaces in the XSLT would work the way they're supposed to.) final_xml = transform(etree.fromstring(etree.tostring(root))) # WRITE OUTPUT TO FILE final_xml.write_output("test.xml") XML Output (test.xml) <?xml version="1.0"?> <Roughness-Profiles xmlns="http://WKI/Roughness-Profiles/1"> <Roughness-Profile> <id>a1</id> <surfaces> <surface> <x1>1.3</x1> <y1>2.1</y1> <z1>3.6</z1> <x2>4.5</x2> <y2>5.1</y2> <z2>6.8</z2> <c1>B</c1> <R>7.3</R> </surface> <surface> <x1>2.2</x1> <y1>3.2</y1> <z1>4.2</z1> <x2>5.2</x2> <y2>6.2</y2> <z2>7.2</z2> <c1>S</c1> <R>8.2</R> </surface> </surfaces> </Roughness-Profile> <Roughness-Profile> <id>b2</id> <surfaces> <surface> <x1>1.1</x1> <y1>2.1</y1> <z1>3.1</z1> <x2>4.1</x2> <y2>5.1</y2> <z2>6.1</z2> <c1>G</c1> <R>7.1</R> </surface> <surface> <x1>4.1</x1> <y1>5.1</y1> <z1>2.1</z1> <x2>7.1</x2> <y2>8.1</y2> <z2>9.1</z2> <c1>S</c1> <R>2.5</R> </surface> <surface> <x1>3.6</x1> <y1>4.5</y1> <z1>5.1</z1> <x2>6.3</x2> <y2>7.4</y2> <z2>8.2</z2> <c1>G</c1> <R>3.1</R> </surface> </surfaces> </Roughness-Profile> <Roughness-Profile> <id>c1</id> <surfaces> <surface> <x1>2.1</x1> <y1>3.1</y1> <z1>4.1</z1> <x2>5.1</x2> <y2>2.1</y2> <z2>7.1</z2> <c1>G</c1> <R>8.1</R> </surface> <surface> <x1>1.5</x1> <y1>1.5</y1> <z1>1.5</z1> <x2>1.5</x2> <y2>1.5</y2> <z2>1.5</z2> <c1>A</c1> <R>1.5</R> </surface> </surfaces> </Roughness-Profile> <Roughness-Profile> <id>c2</id> <surfaces> <surface> <x1>6.1</x1> <y1>7.1</y1> <z1>8.1</z1> <x2>9.1</x2> <y2>2.1</y2> <z2>11.1</z2> <c1>S</c1> <R>3.2</R> </surface> </surfaces> </Roughness-Profile> </Roughness-Profiles>
First read all rows from CSV and sort them. Later you can use variable previous_id to open and close Roughness-Profile/surfaces only when id in new row is different then in previous one. I used StringIO to simulate csv file and sys.stdout to simulate xml file - so everybody can copy code and run it to see how it works text ='''id,x1,y1,z1,x2,y2,z2,c1,R a1,1.3,2.1,3.6,4.5,5.1,6.8,B,7.3 b2,1.1,2.1,3.1,4.1,5.1,6.1,G,7.1 c1,2.1,3.1,4.1,5.1,2.1,7.1,G,8.1 a1,2.2,3.2,4.2,5.2,6.2,7.2,S,8.2 b2,4.1,5.1,2.1,7.1,8.1,9.1,S,2.5 b2,3.6,4.5,5.1,6.3,7.4,8.2,G,3.1 c2,6.1,7.1,8.1,9.1,2.1,11.1,S,3.2 c1,1.5,1.5,1.5,1.5,1.5,1.5,A,1.5''' from io import StringIO import csv import sys #csvFile = r'C:\Users\Desktop\test XML\csvfile.csv' #xmlFile = r'C:\Users\Desktop\test XML\myData.xml' #csvData = csv.reader(open(csvFile)) #xmlData = open(xmlFile, 'w') csvData = csv.reader(StringIO(text)) xmlData = sys.stdout # read all data to sort them csvData = list(csvData) tags = [item.replace(' ', '_') for item in csvData[0]] # headers csvData = sorted(csvData[1:]) # sort data without headers xmlData.write('<?xml version="1.0" encoding="UTF-8"?>\n<Roughness-Profiles xmlns="http://WKI/Roughness-Profiles/1">\n') previous_id = None for row in csvData: row_id = row[0] if row_id != previous_id: # close previous group - but only if it is not first group if previous_id is not None: xmlData.write('</surfaces>\n</Roughness-Profile>\n') # open new group xmlData.write('<Roughness-Profile>\n<id>{}</id>\n<surfaces>\n'.format(row_id)) # remeber new group's id previous_id = row_id # surface xmlData.write('<surface>\n') for value, tag in zip(row[1:], tags[1:]): xmlData.write('<{}>{}</{}>\n'.format(tag, value, tag)) xmlData.write('</surface>\n') # close last group xmlData.write('</surfaces>\n</Roughness-Profile>\n') xmlData.write('</Roughness-Profiles>\n') #xmlData.close() Version without StringIO and sys.stdout import csv csvFile = r'C:\Users\Desktop\test XML\csvfile.csv' xmlFile = r'C:\Users\Desktop\test XML\myData.xml' csvData = csv.reader(open(csvFile)) xmlData = open(xmlFile, 'w') # read all data to sort them csvData = list(csvData) tags = [item.replace(' ', '_') for item in csvData[0]] # headers csvData = sorted(csvData[1:]) # sort data without headers xmlData.write('<?xml version="1.0" encoding="UTF-8"?>\n<Roughness-Profiles xmlns="http://WKI/Roughness-Profiles/1">\n') previous_id = None for row in csvData: row_id = row[0] if row_id != previous_id: # close previous group - but only if it is not first group if previous_id is not None: xmlData.write('</surfaces>\n</Roughness-Profile>\n') # open new group xmlData.write('<Roughness-Profile>\n<id>{}</id>\n<surfaces>\n'.format(row_id)) # remeber new group's id previous_id = row_id # surface xmlData.write('<surface>\n') for value, tag in zip(row[1:], tags[1:]): xmlData.write('<{}>{}</{}>\n'.format(tag, value, tag)) xmlData.write('</surface>\n') # close last group xmlData.write('</surfaces>\n</Roughness-Profile>\n') xmlData.write('</Roughness-Profiles>\n') xmlData.close()
Because XML files are not text files but special text-based documents of structured data adhering to W3C specifications, avoiding building the document by string concatenation. Instead use appropriate DOM libraries available in virtually all modern programming languages including Python with its built-in xml.etree or more robust, third-party module, lxml. In fact, because your desired output involves grouping nodes by id, consider running XSLT, the special-purpose language designed to transform XML files. The module, lxml can run XSLT 1.0 scripts. Below uses the DictReader of built-in csv module to build a nested id dictionary (all columns grouped under id keys). Then, XML is built by iterating through content of this dictionary to write data to element nodes. import csv from collections import OrderedDict import lxml.etree as ET # BUILD NESTED ID DICTIONARY FROM CSV with open("Input.csv") as f: reader = csv.DictReader(f) id_dct = OrderedDict({}) for dct in reader: if dct["id"] not in id_dct.keys(): id_dct[dct["id"]] = [OrderedDict({k:v for k,v in dct.items() if k!= "id"})] else: id_dct[dct["id"]].append(OrderedDict({k:v for k,v in dct.items() if k!= "id"})) # INITIALIZING XML FILE WITH ROOT AND NAMESPACE root = ET.Element('R-Profiles', nsmap={None: "http://WKI/Roughness-Profiles/1"}) # WRITING TO XML NODES for k,v in id_dct.items(): rpNode = ET.SubElement(root, "R-Profile") ET.SubElement(rpNode, "id").text = str(k) surfacesNode = ET.SubElement(rpNode, "surfaces") for dct in v: surfaceNode = ET.SubElement(surfacesNode, "surface") for k,v in dct.items(): ET.SubElement(surfaceNode, k).text = str(v) # OUTPUT XML CONTENT TO FILE tree_out = ET.tostring(root, pretty_print=True, xml_declaration=True, encoding="UTF-8") with open('Output.xml','wb') as f: f.write(tree_out) Input.csv id,x1,y1,z1,x2,y2,z2,c1,R a1,1.3,2.1,3.6,4.5,5.1,6.8,B,7.3 b2,1.1,2.1,3.1,4.1,5.1,6.1,G,7.1 c1,2.1,3.1,4.1,5.1,2.1,7.1,G,8.1 a1,2.2,3.2,4.2,5.2,6.2,7.2,S,8.2 b2,4.1,5.1,2.1,7.1,8.1,9.1,S,2.5 b2,3.6,4.5,5.1,6.3,7.4,8.2,G,3.1 c2,6.1,7.1,8.1,9.1,2.1,11.1,S,3.2 c1,1.5,1.5,1.5,1.5,1.5,1.5,A,1.5 Output.xml <?xml version='1.0' encoding='UTF-8'?> <R-Profiles xmlns="http://WKI/Roughness-Profiles/1"> <R-Profile> <id>a1</id> <surfaces> <surface> <x1>1.3</x1> <y1>2.1</y1> <z1>3.6</z1> <x2>4.5</x2> <y2>5.1</y2> <z2>6.8</z2> <c1>B</c1> <R>7.3</R> </surface> <surface> <x1>2.2</x1> <y1>3.2</y1> <z1>4.2</z1> <x2>5.2</x2> <y2>6.2</y2> <z2>7.2</z2> <c1>S</c1> <R>8.2</R> </surface> </surfaces> </R-Profile> <R-Profile> <id>b2</id> <surfaces> <surface> <x1>1.1</x1> <y1>2.1</y1> <z1>3.1</z1> <x2>4.1</x2> <y2>5.1</y2> <z2>6.1</z2> <c1>G</c1> <R>7.1</R> </surface> <surface> <x1>4.1</x1> <y1>5.1</y1> <z1>2.1</z1> <x2>7.1</x2> <y2>8.1</y2> <z2>9.1</z2> <c1>S</c1> <R>2.5</R> </surface> <surface> <x1>3.6</x1> <y1>4.5</y1> <z1>5.1</z1> <x2>6.3</x2> <y2>7.4</y2> <z2>8.2</z2> <c1>G</c1> <R>3.1</R> </surface> </surfaces> </R-Profile> <R-Profile> <id>c1</id> <surfaces> <surface> <x1>2.1</x1> <y1>3.1</y1> <z1>4.1</z1> <x2>5.1</x2> <y2>2.1</y2> <z2>7.1</z2> <c1>G</c1> <R>8.1</R> </surface> <surface> <x1>1.5</x1> <y1>1.5</y1> <z1>1.5</z1> <x2>1.5</x2> <y2>1.5</y2> <z2>1.5</z2> <c1>A</c1> <R>1.5</R> </surface> </surfaces> </R-Profile> <R-Profile> <id>c2</id> <surfaces> <surface> <x1>6.1</x1> <y1>7.1</y1> <z1>8.1</z1> <x2>9.1</x2> <y2>2.1</y2> <z2>11.1</z2> <c1>S</c1> <R>3.2</R> </surface> </surfaces> </R-Profile> </R-Profiles>
Python / Get unique tokens from a file with a exception
I want to find the number of unique tokens in a file. For this purpose I wrote the below code: splittedWords = open('output.txt', encoding='windows-1252').read().lower().split() uniqueValues = set(splittedWords) print(uniqueValues) The output.txt file is like this: Türkiye+Noun ,+Punc terörizm+Noun+Gen ve+Conj kitle+Noun imha+Noun silah+Noun+A3pl+P3sg+Gen küresel+Adj düzey+Noun+Loc olus+Verb+Caus+PastPart+P3sg tehdit+Noun+Gen boyut+Noun+P3sg karsi+Adj+P3sg+Loc ,+Punc tüm+Det ülke+Noun+A3pl+Gen yay+Verb+Pass+Inf2+Gen önle+Verb+Pass+Inf2+P3sg hedef+Noun+A3pl+P3sg+Acc paylas+Verb+PastPart+P3pl ,+Punc daha+Noun güven+Noun+With ve+Conj istikrar+Noun+With bir+Num dünya+Noun düzen+Noun+P3sg için+PostpPCGen birlik+Noun+Loc çaba+Noun göster+Verb+PastPart+P3pl bir+Num asama+Noun+Dat gel+Verb+Pass+Inf2+P3sg+Acc samimi+Adj ol+Verb+ByDoingSo arzula+Verb+Prog2+Cop .+Punc Ab+Noun ile+PostpPCNom gümrük+Noun Alan+Noun+P3sg+Loc+Rel kurumsal+Adj iliski+Noun+A3pl club+Noun toplanti+Noun+A3pl+P3sg Türkiye+Noun+Gen -+Punc At+Noun gümrük+Noun isbirlik+Noun+P3sg komite+Noun+P3sg ,+Punc Ankara+Noun Anlasma+Noun+P3sg+Gen 6+Num madde+Noun+P3sg uyar+Verb+When ortaklik+Noun rejim+Noun+P3sg+Gen uygula+Verb+Pass+Inf2+P3sg+Acc ve+Conj gelis+Verb+Inf2+P3sg+Acc sagla+Verb+Inf1 üzere+PostpPCNom ortaklik+Noun Konsey+Noun+P3sg+Gen 2+Num /+Punc 69+Num sayili+Adj karar+Noun+P3sg ile+Conj teknik+Noun komite+Noun mahiyet+Noun+P3sg+Loc kur+Verb+Pass+Narr+Cop .+Punc nispi+Adj nisbi+Adj görece+Adj+With izafi+Adj obur+Adj With this code I can get the unique tokens like Türkiye+Noun, Türkiye+Noun+Gen. But I want to get forexample Türkiye+Noun, Türkiye+Noun+Gen like only one token before the + sign. I only want Türkiye part. In the end Türkiye+Noun and Türkiye+Noun+Gen tokens needs to be same and only treated as a single unique token. I think I need to write regex for this purpose.
It seems the word you want is always the 1st in a list of '+'-joined words: Split the splitted words at + and take the 0th one: text = """Türkiye+Noun ,+Punc terörizm+Noun+Gen ve+Conj kitle+Noun imha+Noun silah+Noun+A3pl+P3sg+Gen küresel+Adj düzey+Noun+Loc olus+Verb+Caus+PastPart+P3sg tehdit+Noun+Gen boyut+Noun+P3sg karsi+Adj+P3sg+Loc ,+Punc tüm+Det ülke+Noun+A3pl+Gen yay+Verb+Pass+Inf2+Gen önle+Verb+Pass+Inf2+P3sg hedef+Noun+A3pl+P3sg+Acc paylas+Verb+PastPart+P3pl ,+Punc daha+Noun güven+Noun+With ve+Conj istikrar+Noun+With bir+Num dünya+Noun düzen+Noun+P3sg için+PostpPCGen birlik+Noun+Loc çaba+Noun göster+Verb+PastPart+P3pl bir+Num asama+Noun+Dat gel+Verb+Pass+Inf2+P3sg+Acc samimi+Adj ol+Verb+ByDoingSo arzula+Verb+Prog2+Cop .+Punc Ab+Noun ile+PostpPCNom gümrük+Noun Alan+Noun+P3sg+Loc+Rel kurumsal+Adj iliski+Noun+A3pl club+Noun toplanti+Noun+A3pl+P3sg Türkiye+Noun+Gen -+Punc At+Noun gümrük+Noun isbirlik+Noun+P3sg komite+Noun+P3sg ,+Punc Ankara+Noun Anlasma+Noun+P3sg+Gen 6+Num madde+Noun+P3sg uyar+Verb+When ortaklik+Noun rejim+Noun+P3sg+Gen uygula+Verb+Pass+Inf2+P3sg+Acc ve+Conj gelis+Verb+Inf2+P3sg+Acc sagla+Verb+Inf1 üzere+PostpPCNom ortaklik+Noun Konsey+Noun+P3sg+Gen 2+Num /+Punc 69+Num sayili+Adj karar+Noun+P3sg ile+Conj teknik+Noun komite+Noun mahiyet+Noun+P3sg+Loc kur+Verb+Pass+Narr+Cop .+Punc nispi+Adj nisbi+Adj görece+Adj+With izafi+Adj obur+Adj """ splittedWords = text.lower().replace("\n"," ").split() uniqueValues = set( ( s.split("+")[0] for s in splittedWords)) print(uniqueValues) Output: {'imha', 'çaba', 'ülke', 'arzula', 'terörizm', 'olus', 'daha', 'istikrar', 'küresel', 'sagla', 'önle', 'üzere', 'nisbi', 'türkiye', 'gelis', 'bir', 'karar', 'hedef', '2', 've', 'silah', 'kur', 'alan', 'club', 'boyut', '-', 'anlasma', 'iliski', 'izafi', 'kurumsal', 'karsi', 'ankara', 'ortaklik', 'obur', 'kitle', 'güven', 'uygula', 'ol', 'düzey', 'konsey', 'teknik', 'rejim', 'komite', 'gümrük', 'samimi', 'gel', 'yay', 'toplanti', '.', 'asama', 'mahiyet', 'ab', '69', 'için', 'paylas', '6', '/', 'nispi', 'dünya', 'at', 'sayili', 'görece', 'isbirlik', 'birlik', ',', 'tüm', 'ile', 'düzen', 'uyar', 'göster', 'tehdit', 'madde'} You might need to do some additional cleanup to remove things like ',' '6' '/' Split and remove anything thats just numbers or punctuation from string import digits, punctuation remove=set(digits+punctuation) splittedWords = text.lower().split() uniqueValues = set( ( s.split("+")[0] for s in splittedWords)) # remove from set anything that only consists of numbers or punctuation uniqueValues = uniqueValues - set ( x for x in uniqueValues if all(c in remove for c in x)) print(uniqueValues) to get it as: {'teknik', 'yay', 'göster','hedef', 'terörizm', 'ortaklik','ile', 'daha', 'ol', 'istikrar', 'paylas', 'nispi', 'üzere', 'sagla', 'tüm', 'önle', 'asama', 'uygula', 'güven', 'kur', 'türkiye', 'gel', 'dünya', 'gelis', 'sayili', 'ab', 'club', 'küresel', 'imha', 'çaba', 'olus', 'iliski', 'izafi', 'mahiyet', 've', 'düzey', 'anlasma', 'tehdit', 'bir', 'düzen', 'obur', 'samimi', 'boyut', 'ülke', 'arzula', 'rejim', 'gümrük', 'karar', 'at', 'karsi', 'nisbi', 'isbirlik', 'alan', 'toplanti', 'ankara', 'birlik', 'kurumsal', 'için', 'kitle', 'komite', 'silah', 'görece', 'uyar', 'madde', 'konsey'}
You can split all the tokens you have now on "+" and take only the first one. uniqueValues = set(map(lambda x: x.split('+')[0], splittedWords)) Here I use map. Map will apply the function (the lambda part) on all values of the splittedWords.
Python unicode decode not working for outlook exported csv
Hi I've exported an outlook contacts csv file and loaded it into a python shell. I have a number of European names in the list and the following for example tmp = 'Fern\xc3\x9fndez' tmp.encode("latin-1") results in an error UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128) while tmp.decode('latin-1') gives me u'Fern\xc3\x9fndez' How do I get the text to read as Fernandez? (not too worried about the accents - but happy to have them)
You must be using Python 2.x. Here is one way to print out the character (depending on which encoding you are working with): >>> tmp = 'Fern\xc3\x9fndez' >>> print tmp.decode('utf-8') # print formats the string for stdout Fernßndez >>> print tmp.decode('latin1') FernÃndez Are you sure you have the right character? Is it utf-8? And another way: >>> print unicode(tmp, 'latin1') FernÃndez >>> print unicode(tmp, 'utf-8') Fernßndez Interesting. So none of these options worked for you? Incidentally, I ran the string through a few other encodings just to see if any of them had a character more in line with what I would expect. Unfortunately, I don't see any that look quite right: >>> for encoding in ['ascii', 'big5', 'big5hkscs', 'cp037', 'cp424', 'cp437', 'cp500', 'cp737', 'cp775', 'cp850', 'cp852', 'cp855', 'cp856', 'cp857', 'cp860', 'cp861', 'cp862', 'cp863', 'cp864', 'cp865', 'cp866', 'cp869', 'cp874', 'cp875', 'cp932', 'cp949', 'cp950', 'cp1006', 'cp1026', 'cp1140', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254', 'cp1255', 'cp1256', 'cp1257', 'cp1258', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213', 'euc_kr', 'gb2312', 'gbk', 'gb18030', 'hz', 'iso2022_jp', 'iso2022_jp_1', 'iso2022_jp_2', 'iso2022_jp_2004', 'iso2022_jp_3', 'iso2022_jp_ext', 'iso2022_kr', 'latin_1', 'iso8859_2', 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6', 'iso8859_7', 'iso8859_8', 'iso8859_9', 'iso8859_10', 'iso8859_13', 'iso8859_14', 'iso8859_15', 'johab', 'koi8_r', 'koi8_u', 'mac_cyrillic', 'mac_greek', 'mac_iceland', 'mac_latin2', 'mac_roman', 'mac_turkish', 'ptcp154', 'shift_jis', 'shift_jis_2004', 'shift_jisx0213', 'utf_16', 'utf_16_be', 'utf_16_le', 'utf_7', 'utf_8']: try: print encoding + ': ' + tmp.decode(encoding) except: pass cp037: ãÁÊ>C¤>ÀÁ: cp437: Fern├ƒndez cp500: ãÁÊ>C¤>ÀÁ: cp737: Fern├θndez cp775: Fern├¤ndez cp850: Fern├ƒndez cp852: Fern├čndez cp855: Fern├Ъndez cp857: Fern├şndez cp860: Fern├Óndez cp861: Fern├ƒndez cp862: Fern├ƒndez cp863: Fern├ƒndez cp865: Fern├ƒndez cp866: Fern├Яndez cp869: Fern├ίndez cp875: ΖΧΈ>Cμ>ΦΧ: cp932: Fernテ殤dez cp949: Fern횩ndez cp1006: Fernﺣndez cp1026: ãÁÊ>C¤>ÀÁ: cp1140: ãÁÊ>C€>ÀÁ: cp1250: FernĂźndez cp1251: FernГџndez cp1252: Fernßndez cp1254: Fernßndez cp1256: Fernأںndez cp1258: FernĂŸndez gbk: Fern脽ndez gb18030: Fern脽ndez latin_1: FernÃndez iso8859_2: FernĂndez iso8859_4: FernÃndez iso8859_5: FernУndez iso8859_6: Fernأndez iso8859_7: FernΓndez iso8859_9: FernÃndez iso8859_10: FernÃndez iso8859_13: FernĆndez iso8859_14: FernÃndez iso8859_15: FernÃndez koi8_r: Fernц÷ndez koi8_u: Fernц÷ndez mac_cyrillic: Fern√Яndez mac_greek: FernΟündez mac_iceland: Fern√ündez mac_latin2: Fern√ündez mac_roman: Fern√ündez mac_turkish: Fern√ündez ptcp154: FernГҹndez shift_jis: Fernテ殤dez shift_jis_2004: Fernテ殤dez shift_jisx0213: Fernテ殤dez utf_16: 敆湲鿃摮穥 utf_16_be: 䙥牮쎟湤敺 utf_16_le: 敆湲鿃摮穥 utf_8: Fernßndez
matching data packets and ICMP packets in case of TCP duplicates
I'm trying to match data packets with the ICMP time-exceeded packets they triggered. Therefore, I'm comparing 28-byte-long strings of each data packet (IP header + 8B of payload) with all (28-byte-long) ICMP payloads. I'm having problems when I'm sending duplicate TCP packets: >>> p1 <IP version=4L ihl=5L tos=0x0 len=60 id=0 flags=DF frag=0L ttl=1 proto=tcp chksum=0x7093 src=XXX dst=YYY options=[] |<TCP sport=10743 dport=37901 seq=2939035442L ack=2703569003L dataofs=10L reserved=0L flags=SA window=14480 chksum=0x9529 urgptr=0 options=[('MSS', 1460), ('SAckOK', ''), ('Timestamp', (215365485, 52950)), ('NOP', None), ('WScale', 4)] |>> >>> p2 <IP version=4L ihl=5L tos=0x0 len=60 id=0 flags=DF frag=0L ttl=1 proto=tcp chksum=0x7093 src=XXX dst=YYY options=[] |<TCP sport=10743 dport=37901 seq=2939035442L ack=2703569003L dataofs=10L reserved=0L flags=SA window=14480 chksum=0x9426 urgptr=0 options=[('MSS', 1460), ('SAckOK', ''), ('Timestamp', (215365744, 52950)), ('NOP', None), ('WScale', 4)] |>> ...whose first 28 bytes are the same, but differ in the rest of the tcp header: 'E\x00\x00<\x00\x00#\x00\x01\x06p\x93\x8a`t\x86\xb2.X\x14)\xf7\x94\r\xaf.\x1f2' 'E\x00\x00<\x00\x00#\x00\x01\x06p\x93\x8a`t\x86\xb2.X\x14)\xf7\x94\r\xaf.\x1f2' The ICMP packets I got have thus the same payload: >>> i1[ICMP] <ICMP type=time-exceeded code=ttl-zero-during-transit chksum=0x689a unused=0 |<IPerror version=4L ihl=5L tos=0x0 len=60 id=0 flags=DF frag=0L ttl=1 proto=tcp chksum=0x7093 src=XXX dst=YYY options=[] |<TCPerror sport=10743 dport=37901 seq=2939035442L |>>> >>> i2[ICMP] <ICMP type=time-exceeded code=ttl-zero-during-transit chksum=0x689a unused=0 |<IPerror version=4L ihl=5L tos=0x0 len=60 id=0 flags=DF frag=0L ttl=1 proto=tcp chksum=0x7093 src=XXX dst=YYY options=[] |<TCPerror sport=10743 dport=37901 seq=2939035442L |>>> Corresponding strings are: 'E\x00\x00<\x00\x00#\x00\x01\x06p\x93\x8a`t\x86\xb2.X\x14)\xf7\x94\r\xaf.\x1f2' 'E\x00\x00<\x00\x00#\x00\x01\x06p\x93\x8a`t\x86\xb2.X\x14)\xf7\x94\r\xaf.\x1f2' Right now in this particular case I'm claiming that a1 matches i1 because between i1 and i2, it is i1 that arrived soon after the sending of a1, whereas i2 arrived much later. Is this enough? What else am I missing?
The header size of a TCP packet is not always 20 bytes. If there are options set, the header could be larger. You can use the Internet Header Length field to find the header size and add the amount of payload you want to that number. Scapy: how do I get the full IP packet header?