Python3.7 i can't convert this byte to string

Python3.7 i can't convert this byte to string - python

I have this code:
byte = b'\x7f\x9fKL\xaa\xe6\xc8\x8d\xdf865\xf1s\t`R\xd6\xe8\x9c\x07\xae\x97\xe4\x0e\xe6\x08_CZY(1\x94\xca1\x165m\xd6m\x90xs\xc7\x90d\x0c\xe3\xe9;\x9ec\xd3Q\xe6\x11<z\xff:\x97\x9cz\x86{\xdd\x82S\xfc_\xbcow,`i<\xdd\x0f\xe0^\xb12\xdc,\xf5\x08\xdeey\xbb\xf4o\xadx\xc8(\xd0\xab)\xc1\x7f\xbe<z\xderLp\xa0\x02\x0c\x87!+q\x90\xae\x17\xd0\\y04\x1f\xae\xd2x\xc2\x92\xd4\xd5\x04\x9c\x9c\xc7\x0e\xcbxb\x81\xab\xe4w\xf4\xa1\x9f5\xb1p\xf1\xdf\x12^\x00lA\x83\xe1KP\xdb\xa93\x83\x13\x19\xb8\xf7RA\xe8\xe7\xdcU\xfc\xff\xbcJ\x9d\xc2\xba \xd5\xd5>\x15X#=\xf9\xdf\xbe\xee.\xc5\x82c\r\xd6\xad\x88=\xfc\x9f\xf4%+\xf5\ry\xb7\xb2\xabN\x1a\xb5$\xb6\x8b\x7f2sT\x9eo//\xb3\xbe\xdc\xc8\xbc\xc40\xae/P\xef\x1a\x0bP\x96R\xa0p\xe5\x8a\xad\x11\xe5u\xaa\xcbR'
print(str(byte,'utf-8'))
I want to convert this byte to string and and this to a json file so I can take the string and convert it back to byte when I want to use it.
but when I try to convert it gives such an error:
Traceback (most recent call last):
File "wallet.py", line 126, in <module>
print(str(byte,'utf-8'))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9f in position 1:
invalid start byte`

You could decode it like this,
>>> byte
b'\x7f\x9fKL\xaa\xe6\xc8\x8d\xdf865\xf1s\tR\xd6\xe8\x9c\x07\xae\x97\xe4\x0e\xe6\x08_CZY(1\x94\xca1\x165m\xd6m\x90xs\xc7\x90d\x0c\xe3\xe9;\x9ec\xd3Q\xe6\x11<z\xff:\x97\x9cz\x86{\xdd\x82S\xfc_\xbcow,i<\xdd\x0f\xe0^\xb12\xdc,\xf5\x08\xdeey\xbb\xf4o\xadx\xc8(\xd0\xab)\xc1\x7f\xbe\x15X#=\xf9\xdf\xbe\xee.\xc5\x82c\r\xd6\xad\x88=\xfc\x9f\xf4%+\xf5\ry\xb7\xb2\xabN\x1a\xb5$\xb6\x8b\x7f2sT\x9eo//\xb3\xbe\xdc\xc8\xbc\xc40\xae/P\xef\x1a\x0bP\x96R\xa0p\xe5\x8a\xad\x11\xe5u\xaa\xcbR'
>>> print(''.join(chr(x) for x in byte))
KLªæÈß865ñs RÖè®ä_CZY(1Ê15mÖmxsÇd
y·²«N▒µ$¶2sTo//³¾ÜÈ¼Ä0®/Pï▒ ãé;cÓQæ<zÿ:z{ÝSü_¼ow,i<Ýà^±2Ü,Þey»ôoxÈ(Ð«)Á¾X#=ùß¾î.Åc
PR pååuªËR
You can see what is going on here,
>>> y = [chr(x) for x in byte]
>>> y
['\x7f', '\x9f', 'K', 'L', 'ª', 'æ', 'È', '\x8d', 'ß', '8', '6', '5', 'ñ', 's', '\t', 'R', 'Ö', 'è', '\x9c', '\x07', '®', '\x97', 'ä', '\x0e', 'æ', '\x08', '_', 'C', 'Z', 'Y', '(', '1', '\x94', 'Ê', '1', '\x16', '5', 'm', 'Ö', 'm', '\x90', 'x', 's', 'Ç', '\x90', 'd', '\x0c', 'ã', 'é', ';', '\x9e', 'c', 'Ó', 'Q', 'æ', '\x11', '<', 'z', 'ÿ', ':', '\x97', '\x9c', 'z', '\x86', '{', 'Ý', '\x82', 'S', 'ü', '_', '¼', 'o', 'w', ',', 'i', '<', 'Ý', '\x0f', 'à', '^', '±', '2', 'Ü', ',', 'õ', '\x08', 'Þ', 'e', 'y', '»', 'ô', 'o', '\xad', 'x', 'È', '(', 'Ð', '«', ')', 'Á', '\x7f', '¾', '\x15', 'X', '#', '=', 'ù', 'ß', '¾', 'î', '.', 'Å', '\x82', 'c', '\r', 'Ö', '\xad', '\x88', '=', 'ü', '\x9f', 'ô', '%', '+', 'õ', '\r', 'y', '·', '²', '«', 'N', '\x1a', 'µ', '$', '¶', '\x8b', '\x7f', '2', 's', 'T', '\x9e', 'o', '/', '/', '³', '¾', 'Ü', 'È', '¼', 'Ä', '0', '®', '/', 'P', 'ï', '\x1a', '\x0b', 'P', '\x96', 'R', '\xa0', 'p', 'å', '\x8a', '\xad', '\x11', 'å', 'u', 'ª', 'Ë', 'R']
>>> [ord(x) for x in y]
[127, 159, 75, 76, 170, 230, 200, 141, 223, 56, 54, 53, 241, 115, 9, 82, 214, 232, 156, 7, 174, 151, 228, 14, 230, 8, 95, 67, 90, 89, 40, 49, 148, 202, 49, 22, 53, 109, 214, 109, 144, 120, 115, 199, 144, 100, 12, 227, 233, 59, 158, 99, 211, 81, 230, 17, 60, 122, 255, 58, 151, 156, 122, 134, 123, 221, 130, 83, 252, 95, 188, 111, 119, 44, 105, 60, 221, 15, 224, 94, 177, 50, 220, 44, 245, 8, 222, 101, 121, 187, 244, 111, 173, 120, 200, 40, 208, 171, 41, 193, 127, 190, 21, 88, 35, 61, 249, 223, 190, 238, 46, 197, 130, 99, 13, 214, 173, 136, 61, 252, 159, 244, 37, 43, 245, 13, 121, 183, 178, 171, 78, 26, 181, 36, 182, 139, 127, 50, 115, 84, 158, 111, 47, 47, 179, 190, 220, 200, 188, 196, 48, 174, 47, 80, 239, 26, 11, 80, 150, 82, 160, 112, 229, 138, 173, 17, 229, 117, 170, 203, 82]
>>> bytes([ord(x) for x in y])
b'\x7f\x9fKL\xaa\xe6\xc8\x8d\xdf865\xf1s\tR\xd6\xe8\x9c\x07\xae\x97\xe4\x0e\xe6\x08_CZY(1\x94\xca1\x165m\xd6m\x90xs\xc7\x90d\x0c\xe3\xe9;\x9ec\xd3Q\xe6\x11<z\xff:\x97\x9cz\x86{\xdd\x82S\xfc_\xbcow,i<\xdd\x0f\xe0^\xb12\xdc,\xf5\x08\xdeey\xbb\xf4o\xadx\xc8(\xd0\xab)\xc1\x7f\xbe\x15X#=\xf9\xdf\xbe\xee.\xc5\x82c\r\xd6\xad\x88=\xfc\x9f\xf4%+\xf5\ry\xb7\xb2\xabN\x1a\xb5$\xb6\x8b\x7f2sT\x9eo//\xb3\xbe\xdc\xc8\xbc\xc40\xae/P\xef\x1a\x0bP\x96R\xa0p\xe5\x8a\xad\x11\xe5u\xaa\xcbR'
>>>
>>> len(y)
171
>>> len(byte)
171

try to use this code:
print(byte.decode('latin-1'))

Related

Is it possible to find the index of elements with char value higher than `n` with numpy?

Basically I have something like this :
letters = "ABNJDSJHIUOIUIYEIUWEYIUJHAJHSGJHASNMVFDJHKIUYEIUWYEWUIEYUIUYIEJSGCDJHDS"
And I want to find the index of letters above let's say M. I want to do something like :
import numpy as np
letters = "ABNJDSJHIUOIUIYEIUWEYIUJHAJHSGJHASNMVFDJHKIUYEIUWYEWUIEYUIUYIEJSGCDJHDS"
# - test
np_array = np.array(np.where(letters > chr(77))[0])
Is this possible? or do I have do something like letters not in ...?

Convert letters to a character array:
>>> ar = np.array(list(letters))
>>> ar
array(['A', 'B', 'N', 'J', 'D', 'S', 'J', 'H', 'I', 'U', 'O', 'I', 'U',
'I', 'Y', 'E', 'I', 'U', 'W', 'E', 'Y', 'I', 'U', 'J', 'H', 'A',
'J', 'H', 'S', 'G', 'J', 'H', 'A', 'S', 'N', 'M', 'V', 'F', 'D',
'J', 'H', 'K', 'I', 'U', 'Y', 'E', 'I', 'U', 'W', 'Y', 'E', 'W',
'U', 'I', 'E', 'Y', 'U', 'I', 'U', 'Y', 'I', 'E', 'J', 'S', 'G',
'C', 'D', 'J', 'H', 'D', 'S'], dtype='<U1')
>>> np.where(ar > 'M')[0]
array([ 2, 5, 9, 10, 12, 14, 17, 18, 20, 22, 28, 33, 34, 36, 43, 44, 47,
48, 49, 51, 52, 55, 56, 58, 59, 63, 70], dtype=int64)
Byte arrays can also be:
>>> ar = np.array(bytearray(letters.encode()))
>>> ar
array([65, 66, 78, 74, 68, 83, 74, 72, 73, 85, 79, 73, 85, 73, 89, 69, 73,
85, 87, 69, 89, 73, 85, 74, 72, 65, 74, 72, 83, 71, 74, 72, 65, 83,
78, 77, 86, 70, 68, 74, 72, 75, 73, 85, 89, 69, 73, 85, 87, 89, 69,
87, 85, 73, 69, 89, 85, 73, 85, 89, 73, 69, 74, 83, 71, 67, 68, 74,
72, 68, 83], dtype=uint8)
>>> np.where(ar > ord('M'))[0]
array([ 2, 5, 9, 10, 12, 14, 17, 18, 20, 22, 28, 33, 34, 36, 43, 44, 47,
48, 49, 51, 52, 55, 56, 58, 59, 63, 70], dtype=int64)

Find the product of the minimum height of defenders lower than 180 cm and the maximum height of midfielders higher than 185 cm

positions = ['GK', 'M', 'A', 'D', 'M', 'D', 'M', 'M', 'M', 'A', 'M', 'M', 'A', 'A', 'A', 'M', 'D', 'A', 'D', 'M', 'GK', 'D', 'D', 'M', 'M', 'M', 'M', 'D', 'M', 'GK', 'D', 'GK', 'D', 'D', 'M']
heights = [191, 184, 185, 180, 181, 187, 170, 179, 183, 186, 185, 170, 187, 183, 173, 188, 183, 180, 188, 175, 193, 180, 185, 170, 183, 173, 185, 185, 168, 190, 178, 185, 185, 193, 183]
np_positions = np.array(positions)
np_heights = np.array(heights)
My code is:
print(np.min(positions=='D'[heights<180]) * np.max(positions=='A'[heights > 185]))
I get a TypeError. I made it another way, but I need to do this in 1 string.

IIUC, use your numpy arrays and boolean slicing!
NB. assuming Defender is D and Midfielder is M.
# position is D/M condition on height
(np_heights[(np_positions=='D')&(np_heights<180)].min()
*np_heights[(np_positions=='M')&(np_heights>185)].max()
)
output: 33464 (178*188)

What you could do is to create a 2D list with position and height using zip.
In the min and max functions of python you are able to define your own key on what the minimum and maximum value should be searched for. In this case I used a lambda function using an if else statement.
positions = ['GK', 'M', 'A', 'D', 'M', 'D', 'M', 'M', 'M', 'A', 'M', 'M', 'A', 'A', 'A', 'M', 'D', 'A', 'D', 'M', 'GK', 'D', 'D', 'M', 'M', 'M', 'M', 'D', 'M', 'GK', 'D', 'GK', 'D', 'D', 'M']
heights = [191, 184, 185, 180, 181, 187, 170, 179, 183, 186, 185, 170, 187, 183, 173, 188, 183, 180, 188, 175, 193, 180, 185, 170, 183, 173, 185, 185, 168, 190, 178, 185, 185, 193, 183]
# create a 2d array:
d = list(zip(positions, heights))
min_height_defender = min(d, key=lambda x: x[1] if x[0] == 'D' and x[1] < 180 else 1e9)
max_height_midfielder = max(d, key=lambda x: x[1] if x[0] == 'A' and x[1] > 185 else -1e9)
print( min_height_defender[1] * max_height_midfielder[1] )
>>> 33286
Or you can use numpy, but I would suggest that you test your minimum and maximum value after filtering, otherwise your code becomes unreadable:
# or with numpy:
positions = np.array(positions)
heights = np.array(heights)
print( heights[np.argwhere(positions == 'D')].min() * heights[np.argwhere(positions == 'A')].max() )
>>> 33286
# or without argwhere:
print( heights[positions == 'D'].min() * heights[positions == 'A'].max() )
>>> 33286
If you want to filter inline you can use it, but as said it is very ugly (in my opinion) and you should avoid to do more than 1 thing per line of code. But if you want of just for the kick of doing things in one line:
print( heights[heights<180][positions[heights<180] == 'D'].min() * heights[heights>185][positions[heights>185] == 'A'].max() )
>>> 33286

Get the row index of each extracted character from csv file

I have a column (second column called second_column) in my csv file which represents à list of characters and its positions as follow: the column called character_position
Each line of this column contains a list of character_position . overall l have 300 lines in this column each with list of character position
character_position = [['1', 1890, 1904, 486, 505, '8', 1905, 1916, 486, 507, '4', 1919, 1931, 486, 505, '1', 1935, 1947, 486, 505, '7', 1950, 1962, 486, 505, '2', 1965, 1976, 486, 505, '9', 1980, 1992, 486, 507, '6', 1995, 2007, 486, 505, '/', 2010, 2022, 484, 508, '4', 2025, 2037, 486, 505, '8', 2040, 2052, 486, 505, '3', 2057, 2067, 486, 507, '3', 2072, 2082, 486, 505, '0', 2085, 2097, 486, 507, '/', 2100, 2112, 484, 508, 'Q', 2115, 2127, 486, 507, '1', 2132, 2144, 486, 505, '7', 2147, 2157, 486, 505, '9', 2162, 2174, 486, 505, '/', 2175, 2189, 484, 508, 'C', 2190, 2204, 487, 505, '4', 2207, 2219, 486, 505, '1', 2241, 2253, 486, 505, '/', 2255, 2268, 484, 508, '1', 2271, 2285, 486, 507, '5', 2288, 2297, 486, 505], ['D', 2118, 2132, 519, 535, '.', 2138, 2144, 529, 534, '2', 2150, 2162, 516, 535, '0', 2165, 2177, 516, 535, '4', 2180, 2192, 516, 534, '7', 2196, 2208, 516, 534, '0', 2210, 2223, 514, 535, '1', 2226, 2238, 516, 534, '8', 2241, 2253, 514, 534, '2', 2256, 2267, 514, 535, '4', 2270, 2282, 516, 534, '0', 2285, 2298, 514, 535]]
each character has for values : left, top, right, bottom. For instance character '1' has left=1890, top=1904, right=486, bottom=505.
My file whole csv file is as follow :
df = pd.read_csv(filepath_or_buffer='list_characters.csv', header=None, usecols=[1], names=['character_position])
From this file l created a new csv file with five columns :
column 1: character, column 2 : left , column 3 : top, column 4 : right, column 5 : bottom.
cols = ['char','left','top','right','bottom']
df1 = df.character_position.str.strip('[]').str.split(', ', expand=True)
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)
df1.columns = cols
df1[cols[1:]] = df1[cols[1:]].astype(int)
print (df1)
char left top right bottom
0 'm' 38 104 2456 2492
1 'i' 40 102 2442 222
2 '.' 203 213 191 198
3 '3' 235 262 131 3333
4 'A' 275 347 147 239
5 'M' 363 465 145 3334
6 'A' 73 91 373 394
7 'D' 93 112 373 39
8 'D' 454 473 663 685
9 'O' 474 495 664 33
10 'A' 108 129 727 751
11 'V' 129 150 727 444
l want to add 2 other column called line_number and all_chars_in_same_row
1)line_number corresponds to the line where for example 'm' 38 104 2456 2492 is extracted let say from line 2
2) all_chars_in_same_row corresponds to all (spaced) characters which are in the same row. for instance
character_position = [['1', 1890, 1904, 486, 505, '8', 1905, 1916, 486, 507, '4', 1919, 1931, 486, 505, '1', 1935, 1947, 486, 505, '7', 1950, 1962, 486, 505, '2', 1965, 1976, 486, 505, '9', 1980, 1992, 486, 507, '6', 1995, 2007, 486, 505, '/', 2010, 2022, 484, 508, '4', 2025, 2037, 486, 505, '8', 2040, 2052, 486, 505, '3', 2057, 2067, 486, 507, '3', 2072, 2082, 486, 505, '0', 2085, 2097, 486, 507, '/', 2100, 2112, 484, 508, 'Q', 2115, 2127, 486, 507, '1', 2132, 2144, 486, 505, '7', 2147, 2157, 486, 505, '9', 2162, 2174, 486, 505, '/', 2175, 2189, 484, 508, 'C', 2190, 2204, 487, 505, '4', 2207, 2219, 486, 505, '1', 2241, 2253, 486, 505, '/', 2255, 2268, 484, 508, '1', 2271, 2285, 486, 507, '5', 2288, 2297, 486, 505], ['D', 2118, 2132, 519, 535, '.', 2138, 2144, 529, 534, '2', 2150, 2162, 516, 535, '0', 2165, 2177, 516, 535, '4', 2180, 2192, 516, 534, '7', 2196, 2208, 516, 534, '0', 2210, 2223, 514, 535, '1', 2226, 2238, 516, 534, '8', 2241, 2253, 514, 534, '2', 2256, 2267, 514, 535, '4', 2270, 2282, 516, 534, '0', 2285, 2298, 514, 535]]
l get '1' '8' '4' '1' '7' and so on.
more formally :
all_chars_in_same_row means: write all the character of the given row in line_number column
char left top right bottom line_number all_chars_in_same_row
0 'm' 38 104 2456 2492 from line 2 'm' '2' '5' 'g'
1 'i' 40 102 2442 222 from line 4
2 '.' 203 213 191 198 from line 6
3 '3' 235 262 131 3333
4 'A' 275 347 147 239
5 'M' 363 465 145 3334
6 'A' 73 91 373 394
7 'D' 93 112 373 39
8 'D' 454 473 663 685
9 'O' 474 495 664 33
10 'A' 108 129 727 751
11 'V' 129 150 727 444
EDIT1:
import pandas as pd
df_data=pd.read_csv('/home/ahmed/internship/cnn_ocr/list_characters.csv')
df_data.shape
(50, 3)
df_data.icol(1)
0 [['m', 38, 104, 2456, 2492, 'i', 40, 102, 2442...
1 [['.', 203, 213, 191, 198, '3', 235, 262, 131,...
2 [['A', 275, 347, 147, 239, 'M', 363, 465, 145,...
3 [['A', 73, 91, 373, 394, 'D', 93, 112, 373, 39...
4 [['D', 454, 473, 663, 685, 'O', 474, 495, 664,...
5 [['A', 108, 129, 727, 751, 'V', 129, 150, 727,...
6 [['N', 34, 51, 949, 970, '/', 52, 61, 948, 970...
7 [['S', 1368, 1401, 43, 85, 'A', 1406, 1446, 43...
8 [['S', 1437, 1457, 112, 138, 'o', 1458, 1476, ...
9 [['h', 1686, 1703, 315, 339, 't', 1706, 1715, ...
10 [['N', 1331, 1349, 370, 391, 'C', 1361, 1379, ...
11 [['N', 1758, 1775, 370, 391, 'D', 1785, 1803, ...
12 [['D', 2166, 2184, 370, 391, 'A', 2186, 2205, ...
13 [['2', 1395, 1415, 427, 454, '0', 1416, 1434, ...
14 [['I', 1533, 1545, 487, 541, 'I', 1548, 1551, ...
15 [['P', 1659, 1677, 490, 514, '2', 1680, 1697, ...
16 [['1', 1890, 1904, 486, 505, '8', 1905, 1916, ...
17 [['B', 1344, 1361, 583, 607, 'O', 1364, 1386, ...
18 [['B', 1548, 1580, 979, 1015, 'T', 1586, 1619,...
19 [['Q', 169, 190, 1291, 1312, 'U', 192, 210, 12...
20 [['1', 296, 305, 1492, 1516, 'S', 339, 357, 14...
21 [['G', 339, 362, 1815, 1840, 'S', 365, 384, 18...
22 [['2', 1440, 1455, 2047, 2073, '9', 1458, 1475...
23 [['R', 339, 360, 2137, 2163, 'e', 363, 378, 21...
24 [['R', 339, 360, 1860, 1885, 'e', 363, 380, 18...
25 [['0', 1266, 1283, 1951, 1977, ',', 1287, 1290...
26 [['1', 2207, 2217, 1492, 1515, '0', 2225, 2240...
27 [['1', 2364, 2382, 1552, 1585], [], ['E', 2369...
28 [['S', 2369, 2382, 1833, 1866]]
29 [['0', 2243, 2259, 1951, 1977, '0', 2271, 2288...
30 [['0', 2243, 2259, 2227, 2253, '0', 2271, 2286...
31 [['D', 76, 88, 2580, 2596, 'é', 91, 100, 2580,...
32 [['ü', 1474, 1489, 2586, 2616, '3', 1541, 1557...
33 [['E', 1440, 1461, 2670, 2697, 'U', 1466, 1488...
34 [['2', 1685, 1703, 2670, 2697, '.', 1707, 1712...
35 [['1', 2202, 2213, 2668, 2695, '3', 2220, 2237...
36 [['c', 88, 118, 2872, 2902]]
37 [['N', 127, 144, 2889, 2910, 'D', 156, 175, 28...
38 [['E', 108, 129, 3144, 3172, 'C', 133, 156, 31...
39 [['5', 108, 126, 3204, 3231, '0', 129, 147, 32...
40 [[]]
41 [['1', 480, 492, 3202, 3229, '6', 500, 518, 32...
42 [['P', 217, 234, 3337, 3360, 'A', 235, 255, 33...
43 [[]]
44 [['I', 954, 963, 2892, 2934, 'M', 969, 1011, 2...
45 [['E', 1385, 1407, 2970, 2998, 'U', 1410, 1433...
46 [['T', 2067, 2084, 2889, 2911, 'O', 2088, 2106...
47 [['1', 2201, 2213, 2970, 2997, '6', 2219, 2238...
48 [['M', 1734, 1755, 3246, 3267, 'O', 1758, 1779...
49 [['L', 923, 935, 3411, 3430, 'A', 941, 957, 34...
Name: character_position, dtype: object
Then in my char.csv l do the following
df = pd.read_csv('list_characters.csv', header=None, usecols=[1], names=['character_position'])
df = df.replace(['\[','\]'], ['',''], regex=True)
cols = ['char','left','right','top','bottom']
df1 = df.positionlrtb.str.strip('[]').str.split(',', expand=True)
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)
df1.columns = cols
df1.char = df1.char.replace(['\[','\]'], ['',''], regex=True)
df1['left']=df1['left'].replace(['\[','\]'], ['',''], regex=True)
df1['top']=df1['top'].replace(['\[','\]'], ['',''], regex=True)
df1['right']=df1['right'].replace(['\[','\]'], ['',''], regex=True)
df1['bottom']=df1['bottom'].replace(['\[','\]'], ['',''], regex=True)
df1.to_csv('chars.csv')
However l don't see in your response how you added the columns from_line and all_char_in_same_rows.
when l execute your line of code :
df_data = df_data.character_position.str.strip('[]').str.split(',', expand=True)
l get the following :
df_data[0:10]
0 1 2 3 4 5 6 7 8 9 ... \
0 'm' 38 104 2456 2492 'i' 40 102 2442 2448 ...
1 '.' 203 213 191 198 '3' 235 262 131 198 ...
2 'A' 275 347 147 239 'M' 363 465 145 239 ...
3 'A' 73 91 373 394 'D' 93 112 373 396 ...
4 'D' 454 473 663 685 'O' 474 495 664 687 ...
5 'A' 108 129 727 751 'V' 129 150 727 753 ...
6 'N' 34 51 949 970 '/' 52 61 948 970 ...
7 'S' 1368 1401 43 85 'A' 1406 1446 43 85 ...
8 'S' 1437 1457 112 138 'o' 1458 1476 118 138 ...
9 'h' 1686 1703 315 339 't' 1706 1715 316 339 ...
1821 1822 1823 1824 1825 1826 1827 1828 1829 1830
0 None None None None None None None None None None
1 None None None None None None None None None None
2 None None None None None None None None None None
3 None None None None None None None None None None
4 None None None None None None None None None None
5 None None None None None None None None None None
6 None None None None None None None None None None
Here are the 10 first lines of my csv file :
character_position
0 [['m', 38, 104, 2456, 2492, 'i', 40, 102, 2442, 2448, 'i', 40, 100, 2402, 2410, 'l', 40, 102, 2372, 2382, 'm', 40, 102, 2312, 2358, 'u', 40, 102, 2292, 2310, 'i', 40, 104, 2210, 2260, 'l', 40, 104, 2180, 2208, 'i', 40, 104, 2140, 2166, 'l', 40, 104, 2124, 2134]]
1 [['.', 203, 213, 191, 198, '3', 235, 262, 131, 198]]
2 [['A', 275, 347, 147, 239, 'M', 363, 465, 145, 239, 'S', 485, 549, 145, 243, 'U', 569, 631, 145, 241, 'N', 657, 733, 145, 239]]
3 [['A', 73, 91, 373, 394, 'D', 93, 112, 373, 396, 'R', 115, 133, 373, 396, 'E', 136, 153, 373, 396, 'S', 156, 172, 373, 396, 'S', 175, 192, 373, 396, 'E', 195, 211, 373, 396, 'D', 222, 241, 373, 396, 'E', 244, 261, 373, 396, 'L', 272, 285, 375, 396, 'I', 288, 293, 375, 396, 'V', 296, 314, 375, 396, 'R', 317, 334, 373, 396, 'A', 334, 354, 375, 396, 'I', 357, 360, 373, 396, 'S', 365, 381, 373, 396, 'O', 384, 405, 373, 396, 'N', 408, 425, 373, 394]]
4 [['D', 454, 473, 663, 685, 'O', 474, 495, 664, 687, 'C', 498, 516, 664, 687, 'U', 519, 536, 663, 687, 'M', 540, 561, 663, 687, 'E', 564, 581, 663, 685, 'N', 584, 600, 664, 685, 'T', 603, 618, 663, 685]]
5 [['A', 108, 129, 727, 751, 'V', 129, 150, 727, 753, 'O', 153, 175, 727, 753, 'I', 178, 183, 727, 751, 'R', 187, 210, 727, 751, 'S', 220, 240, 727, 753, 'U', 243, 263, 727, 753, 'R', 267, 288, 727, 751, 'F', 302, 318, 727, 751, 'A', 320, 341, 727, 751, 'C', 342, 363, 726, 751, 'T', 366, 384, 726, 750, 'U', 387, 407, 727, 751, 'R', 411, 432, 727, 751, 'E', 435, 453, 726, 751, 'P', 797, 815, 727, 751, 'A', 818, 839, 727, 751, 'G', 840, 863, 727, 751, 'E', 867, 885, 726, 751, '1', 900, 911, 727, 751, '1', 926, 934, 727, 751, '1', 947, 956, 727, 751, '5', 962, 979, 727, 751], ['R', 120, 142, 778, 807, 'T', 144, 165, 778, 805, 'T', 178, 199, 778, 805, 'e', 201, 219, 786, 807, 'c', 222, 240, 786, 807, 'h', 241, 258, 778, 807, 'n', 263, 279, 786, 807, 'i', 284, 287, 778, 805, 'c', 291, 308, 786, 807, 'a', 309, 327, 786, 807, 'R', 350, 374, 778, 807, 'e', 377, 395, 786, 807, 't', 396, 405, 780, 805, 'u', 408, 425, 786, 807, 'r', 429, 440, 786, 807, 'n', 441, 458, 786, 807, '-', 471, 482, 793, 798, 'D', 497, 518, 778, 807, 'O', 522, 548, 777, 807, 'A', 549, 573, 778, 807, '/', 585, 596, 778, 807, 'D', 606, 630, 778, 807, 'A', 632, 656, 778, 807, 'P', 659, 680, 778, 805]]
6 [['N', 34, 51, 949, 970, '/', 52, 61, 948, 970, 'C', 63, 81, 948, 970, 'O', 84, 103, 948, 970, 'M', 106, 127, 949, 970, 'M', 130, 151, 948, 970, 'A', 153, 172, 949, 970, 'N', 175, 192, 949, 970, 'D', 195, 213, 948, 970, 'E', 217, 232, 948, 970], ['1', 73, 84, 993, 1020, '1', 94, 105, 993, 1020, '8', 112, 130, 991, 1020, '4', 135, 153, 993, 1018, '5', 156, 172, 994, 1018, '7', 175, 192, 993, 1018, '6', 195, 213, 993, 1020, '0', 216, 235, 991, 1020, '6', 238, 257, 993, 1020, '5', 260, 278, 993, 1020, '0', 407, 425, 991, 1020, '9', 428, 446, 991, 1020, '.', 450, 455, 1015, 1020, '0', 459, 477, 991, 1020, '1', 485, 494, 994, 1018, '.', 503, 507, 1015, 1020, '2', 512, 530, 991, 1020, '0', 533, 551, 991, 1020, '1', 555, 566, 993, 1020, '5', 575, 593, 993, 1020, 'R', 632, 656, 991, 1020, 'M', 659, 684, 991, 1020, 'A', 689, 713, 991, 1020, 'N', 726, 747, 993, 1020, 'o', 752, 770, 999, 1020, '.', 774, 779, 1015, 1020, '5', 794, 812, 993, 1020, '8', 815, 833, 991, 1020, '4', 834, 852, 993, 1017, '4', 857, 873, 994, 1018, '3', 878, 896, 991, 1020, '8', 899, 917, 991, 1020, '0', 920, 938, 991, 1020, '/', 950, 960, 991, 1020, '0', 971, 990, 993, 1020, '7', 995, 1011, 993, 1018, '1', 1016, 1026, 993, 1018, '6', 1034, 1052, 993, 1020, '7', 1055, 1073, 993, 1020, '4', 1076, 1094, 993, 1018, '8', 1098, 1116, 991, 1020, '9', 1119, 1137, 991, 1020, '0', 1140, 1158, 993, 1020, '9', 1160, 1178, 991, 1020], ['N', 34, 51, 1045, 1066, '/', 54, 61, 1045, 1066, 'B', 63, 79, 1044, 1066, 'O', 82, 102, 1044, 1066, 'N', 105, 121, 1045, 1066, 'D', 133, 151, 1045, 1066, 'E', 156, 172, 1044, 1066, 'L', 183, 196, 1045, 1066, 'I', 199, 204, 1045, 1066, 'V', 205, 223, 1045, 1066, 'R', 226, 244, 1045, 1066, 'A', 246, 266, 1045, 1066, 'I', 267, 272, 1045, 1066, 'S', 275, 291, 1044, 1066, 'O', 294, 314, 1045, 1066, 'N', 318, 335, 1045, 1066], ['8', 72, 90, 1093, 1122, '2', 93, 109, 1093, 1122, '5', 114, 132, 1095, 1122, '9', 135, 153, 1093, 1122, '7', 154, 172, 1095, 1122, '1', 178, 189, 1093, 1122, '3', 196, 214, 1093, 1122, '1', 220, 231, 1095, 1122, '0', 238, 257, 1093, 1122, '3', 260, 278, 1093, 1122, '0', 407, 425, 1093, 1122, '6', 429, 447, 1095, 1122, '.', 452, 455, 1117, 1122, '0', 459, 477, 1093, 1122, '2', 480, 498, 1093, 1122, '.', 503, 507, 1117, 1122, '2', 512, 530, 1093, 1122, '0', 533, 551, 1093, 1122, '1', 557, 567, 1095, 1122, '5', 575, 593, 1095, 1122], ['v', 70, 90, 1150, 1171, '/', 88, 97, 1150, 1171, 'r', 100, 118, 1150, 1171, 'Ã©', 121, 136, 1144, 1173, 'f', 141, 156, 1150, 1171, 'Ãª', 159, 174, 1144, 1173, 'r', 177, 195, 1150, 1173, 'e', 198, 214, 1150, 1171, 'n', 217, 234, 1150, 1171, 'c', 238, 257, 1149, 1171, 'e', 260, 276, 1149, 1173, 'B', 476, 497, 1152, 1179, 'O', 501, 527, 1149, 1179, 'G', 530, 555, 1150, 1180, 'D', 560, 582, 1152, 1179, 'O', 585, 611, 1149, 1179, 'A', 614, 638, 1150, 1179, '1', 642, 653, 1152, 1179, '5', 659, 677, 1153, 1180, 'B', 681, 701, 1152, 1179, 'T', 705, 726, 1152, 1179, '0', 728, 746, 1152, 1179, '6', 749, 767, 1152, 1179]]
7 [['S', 1368, 1401, 43, 85, 'A', 1406, 1446, 43, 85, 'M', 1451, 1491, 36, 85, 'S', 1500, 1533, 43, 85, 'U', 1539, 1574, 43, 85, 'N', 1581, 1616, 43, 85, 'G', 1623, 1662, 42, 85, 'E', 1686, 1719, 43, 85, 'L', 1725, 1755, 43, 85, 'E', 1763, 1794, 42, 85, 'C', 1800, 1836, 43, 85, 'T', 1841, 1874, 42, 85, 'R', 1880, 1914, 42, 84, 'O', 1919, 1959, 42, 85, 'N', 1965, 1998, 42, 84, 'I', 2007, 2016, 42, 84, 'C', 2022, 2058, 42, 84, 'S', 2066, 2099, 42, 84, 'F', 2121, 2151, 42, 84, 'R', 2159, 2193, 42, 84, 'A', 2198, 2237, 40, 84, 'N', 2243, 2277, 40, 84, 'C', 2285, 2321, 42, 84, 'E', 2328, 2360, 40, 84]]
8 [['S', 1437, 1457, 112, 138, 'o', 1458, 1476, 118, 138, 'c', 1479, 1493, 120, 138, 'i', 1494, 1499, 112, 136, 'Ã©', 1503, 1518, 114, 138, 't', 1520, 1527, 115, 138, 'Ã©', 1530, 1547, 112, 138, 'p', 1559, 1575, 120, 144, 'a', 1577, 1593, 118, 138, 'r', 1596, 1607, 118, 136, 'A', 1616, 1637, 112, 136, 'c', 1640, 1653, 118, 138, 't', 1655, 1664, 115, 136, 'i', 1665, 1670, 112, 136, 'o', 1673, 1688, 118, 138, 'n', 1692, 1707, 118, 136, 's', 1710, 1725, 118, 138, 'S', 1736, 1755, 112, 138, 'i', 1760, 1763, 112, 136, 'm', 1767, 1791, 118, 136, 'p', 1794, 1811, 118, 142, 'l', 1812, 1817, 112, 136, 'i', 1821, 1824, 112, 136, 'f', 1827, 1835, 112, 136, 'i', 1835, 1841, 112, 136, 'Ã©', 1845, 1860, 112, 136, 'e', 1863, 1878, 118, 136, 'a', 1890, 1907, 118, 138, 'u', 1910, 1925, 118, 136, 'C', 1937, 1958, 112, 136, 'a', 1961, 1977, 118, 136, 'p', 1980, 1995, 118, 142, 'i', 1998, 2003, 112, 136, 't', 2006, 2013, 114, 136, 'a', 2015, 2030, 118, 136, 'l', 2034, 2037, 112, 136, 'd', 2051, 2066, 111, 136, 'e', 2069, 2085, 117, 136, '2', 2097, 2112, 112, 136, '7', 2115, 2132, 111, 136, '.', 2136, 2139, 132, 136, '0', 2144, 2159, 111, 136, '0', 2162, 2178, 111, 136, '0', 2180, 2196, 111, 136, '.', 2201, 2205, 132, 135, '0', 2208, 2225, 111, 136, '0', 2228, 2243, 111, 136, '0', 2246, 2261, 111, 136, 't', 2273, 2281, 112, 135, 'i', 2281, 2291, 111, 136], ['1', 1473, 1482, 153, 177, ',', 1491, 1494, 172, 181, 'r', 1508, 1517, 159, 177, 'u', 1520, 1535, 160, 177, 'e', 1538, 1554, 159, 177, 'F', 1566, 1583, 153, 177, 'r', 1587, 1596, 159, 177, 'u', 1598, 1613, 159, 177, 'c', 1617, 1631, 159, 177, 't', 1634, 1641, 154, 177, 'i', 1643, 1646, 153, 177, 'd', 1650, 1665, 151, 177, 'o', 1668, 1685, 159, 177, 'r', 1688, 1697, 159, 177, 'C', 1709, 1730, 153, 177, 'S', 1733, 1751, 153, 177, '2', 1764, 1779, 153, 177, '0', 1781, 1797, 153, 177, '0', 1800, 1817, 153, 177, '3', 1820, 1835, 151, 177, '9', 1847, 1863, 151, 177, '3', 1866, 1883, 151, 177, '4', 1883, 1901, 153, 175, '8', 1904, 1919, 151, 177, '4', 1919, 1937, 153, 175, 'S', 1950, 1968, 151, 177, 'A', 1971, 1992, 151, 175, 'I', 1995, 2000, 151, 175, 'N', 2004, 2024, 151, 175, 'T', 2027, 2046, 151, 175, 'O', 2058, 2081, 151, 177, 'U', 2085, 2105, 151, 177, 'E', 2109, 2127, 151, 177, 'N', 2130, 2150, 151, 175, 'C', 2163, 2186, 151, 175, 'e', 2187, 2204, 157, 175, 'd', 2207, 2222, 150, 175, 'e', 2225, 2240, 157, 175, 'x', 2243, 2258, 157, 175], ['T', 1638, 1656, 192, 216, 'Ã‰', 1659, 1677, 186, 217, 'L', 1682, 1697, 193, 217, 'Ã‰', 1701, 1719, 187, 217, 'P', 1722, 1742, 192, 217, 'H', 1746, 1766, 193, 217, 'O', 1770, 1793, 192, 217, 'N', 1796, 1815, 192, 216, 'E', 1820, 1838, 192, 217, '0', 1869, 1886, 190, 216, '1', 1890, 1899, 192, 216, '4', 1914, 1931, 193, 216, '4', 1934, 1950, 193, 216, '0', 1961, 1977, 190, 216, '4', 1980, 1997, 193, 216, '7', 2009, 2024, 192, 216, '0', 2027, 2042, 192, 216, '0', 2055, 2070, 192, 216, '0', 2073, 2090, 192, 216], ['R', 1517, 1538, 232, 258, '.', 1542, 1545, 253, 256, 'C', 1550, 1571, 232, 256, '.', 1575, 1580, 252, 256, 'S', 1584, 1602, 232, 256, '.', 1607, 1611, 252, 256, 'B', 1625, 1643, 232, 256, 'O', 1649, 1670, 231, 258, 'B', 1674, 1692, 232, 256, 'I', 1697, 1701, 232, 256, 'G', 1706, 1728, 232, 256, 'N', 1731, 1751, 232, 256, 'Y', 1754, 1775, 232, 256, 'B', 1788, 1806, 232, 256, '3', 1818, 1835, 231, 256, '3', 1838, 1855, 231, 256, '4', 1855, 1872, 232, 255, '3', 1884, 1899, 232, 256, '6', 1904, 1919, 232, 256, '7', 1922, 1937, 232, 256, '4', 1947, 1964, 232, 256, '9', 1967, 1983, 232, 256, '7', 1986, 2001, 232, 256, '-', 2013, 2022, 244, 249, 'A', 2034, 2055, 231, 255, 'P', 2057, 2075, 231, 255, 'E', 2079, 2097, 231, 256, '4', 2109, 2126, 232, 255, '6', 2129, 2145, 232, 256, '5', 2148, 2163, 232, 256, '2', 2166, 2183, 232, 255, 'Z', 2193, 2211, 231, 255], ['C', 1628, 1647, 271, 297, 'o', 1652, 1670, 279, 297, 'd', 1671, 1689, 273, 297, 'e', 1692, 1709, 279, 298, 'T', 1721, 1739, 273, 297, 'V', 1742, 1763, 273, 297, 'A', 1763, 1787, 273, 297, 'F', 1818, 1835, 273, 297, 'R', 1839, 1859, 273, 297, '8', 1872, 1889, 273, 297, '9', 1890, 1905, 273, 297, '3', 1919, 1932, 273, 297, '3', 1937, 1952, 273, 297, '4', 1953, 1971, 273, 297, '3', 1983, 1998, 273, 297, '6', 2001, 2018, 273, 297, '7', 2021, 2036, 273, 295, '4', 2048, 2064, 274, 297, '9', 2066, 2082, 273, 297, '7', 2085, 2100, 273, 295]]
9 [['h', 1686, 1703, 315, 339, 't', 1706, 1715, 316, 339, 't', 1718, 1727, 316, 339, 'p', 1730, 1748, 321, 345, 'i', 1751, 1757, 321, 339, 'f', 1760, 1769, 315, 339, '/', 1769, 1776, 313, 339, 'w', 1779, 1804, 321, 337, 'w', 1804, 1829, 321, 339, 'w', 1830, 1854, 321, 337, '.', 1859, 1863, 333, 337, 's', 1868, 1883, 319, 339, 'a', 1886, 1901, 321, 337, 'm', 1905, 1929, 321, 337, 's', 1932, 1949, 321, 339, 'u', 1953, 1968, 321, 339, 'n', 1973, 1989, 321, 339, 'g', 1992, 2010, 319, 345, '.', 2015, 2019, 333, 337, 'f', 2021, 2033, 313, 337, 'r', 2034, 2045, 319, 337]]
10 [['N', 1331, 1349, 370, 391, 'C', 1361, 1379, 370, 393, 'O', 1382, 1403, 370, 393, 'M', 1404, 1425, 370, 391, 'P', 1430, 1446, 370, 391, 'T', 1448, 1464, 370, 391, 'E', 1467, 1484, 370, 393, 'C', 1494, 1512, 370, 393, 'L', 1515, 1532, 370, 393, 'I', 1533, 1539, 370, 393, 'E', 1542, 1559, 370, 393, 'N', 1560, 1580, 370, 393, 'T', 1580, 1598, 370, 393]]
here is the second csv file:
char left right top bottom
0 'm' 38 104 2456 2492
1 'i' 40 102 2442 2448
2 'i' 40 100 2402 2410
3 'l' 40 102 2372 2382
4 'm' 40 102 2312 2358
5 'u' 40 102 2292 2310
6 'i' 40 104 2210 2260
7 'l' 40 104 2180 2208
8 'i' 40 104 2140 2166
EDIT1
HERE IS MY output for solution 2 (`input character_position described` )
1831 1830 level_2 char left top right bottom FromLine all_chars_in_same_row
0 0 character_position 0 character_position 0 character_position
1 1 'm','i','i','l','m','u','i','l','i','l' 0 'm' 38 104 2456 2492 1 'm','i','i','l','m','u','i','l','i','l'
2 1 'm','i','i','l','m','u','i','l','i','l' 1 'i' 40 102 2442 2448 1 'm','i','i','l','m','u','i','l','i','l'
3 1 'm','i','i','l','m','u','i','l','i','l' 2 'i' 40 100 2402 2410 1 'm','i','i','l','m','u','i','l','i','l'
l think the probelm comes from the fact that l have in my data :
[[',' , 'A', ',' , '.', ':' , ';', '1'], [], ['m', 'a',]]
so :
empty `[ ]` causes problem for the order. l noticed that when l tried to omit all [] which are empty beacause l find my csv as follow :
in char : ['a' rather than 'a' for values 8794] rather than 8794 or
[5345 rather than 5345
so processed the csv as follow
df = pd.read_csv(filepath_or_buffer='lit_charaters.csv', header=None, usecols=[1,3], names=['character_position','LineIndex'])
df = df.replace(['\[','\]'], ['',''], regex=True)
cols = ['char','left','right','top','bottom','LineIndex']
df1 = df.positionlrtb.str.strip('[]').str.split(',', expand=True)
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)
df1.columns = cols
df1.char = df1.char.replace(['\[','\]'], ['',''], regex=True)
df1['left']=df1['left'].replace(['\[','\]'], ['',''], regex=True)
df1['top']=df1['top'].replace(['\[','\]'], ['',''], regex=True)
df1['right']=df1['right'].replace(['\[','\]'], ['',''], regex=True)
df1['bottom']=df1['bottom'].replace(['\[','\]'], ['',''], regex=True)
df1.to_csv('char.csv')
Then l noticed the following
look at line 1221 column B it's empty it replaces [] then we get the disorder of columns switched (B and C) due to empty char . How to solve that ?
l have also empty line
3831 '6' 296 314 3204 3231
3832
3833 '1' 480 492 3202 3229
Line 3832 should be removed.
in order to get something like this
**EDIT2:**
In order to solve the problem of empty rows and [] in list_characters.csv
[['1', 2364, 2382, 1552, 1585], [], ['E', 2369, 2381, 1623, 1640], ['8', 2369, 2382, 1644, 1668]]
and
[[]] [[]]
l did the following :
df1 = df.applymap(lambda x: [y for y in x if len(y) > 0])
df1 = df1[df1.applymap(len).ne(0).all(axis=1)]
df1 = df.replace(['\[\],','\[\[\]\]', ''],['','', np.nan], regex=True)
df1 = df1.dropna()
then
df = pd.read_csv('character_position.csv', index_col=0)
df.positionlrtb = df.positionlrtb.apply(ast.literal_eval)
df.positionlrtb = df.positionlrtb.apply(lambda x: [y for y in x if len(y) > 0])
print (df.head())
page_number positionlrtb \
0 1841729699_001 [[m, 38, 104, 2456, 2492, i, 40, 102, 2442, 24...
1 1841729699_001 [[., 203, 213, 191, 198, 3, 235, 262, 131, 198]]
2 1841729699_001 [[A, 275, 347, 147, 239, M, 363, 465, 145, 239...
3 1841729699_001 [[A, 73, 91, 373, 394, D, 93, 112, 373, 396, R...
4 1841729699_001 [[D, 454, 473, 663, 685, O, 474, 495, 664, 687...
LineIndex
0 [[mi, il, mu, il, il]]
1 [[.3]]
2 [[amsun]]
3 [[adresse, de, livraison]]
4 [[document]]
cols = ['char','left','top','right','bottom']
df1 = pd.DataFrame({
"a": np.repeat(df.page_number.values, df.positionlrtb.str.len()),
"b": list(chain.from_iterable(df.positionlrtb))})
df1 = pd.DataFrame(df1.b.values.tolist())
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)
cols = ['char','left','top','right','bottom']
df1.columns = cols
df1[cols[1:]] = df1[cols[1:]].astype(int)
print (df1)
char left top right bottom
0 m 38 104 2456 2492
1 i 40 102 2442 2448
2 i 40 100 2402 2410
3 l 40 102 2372 2382
4 m 40 102 2312 2358
5 u 40 102 2292 2310
6 i 40 104 2210 2260
7 l 40 104 2180 2208
8 i 40 104 2140 2166
However :
df_data = df_data.character_position.str.strip('[]').str.split(', ', expand=True)
returns None values

Once you create required data frame, after stacking , don't remove Index, it holds your line number. Since this is a multilevel indexing , get the first Index- your line number.
df_data['LineIndex'] = df_data.index.get_level_values(0)
Then you can group by LineIndex column and get all characters for the common LineIndex. This is created as dictionary. Convert this dictionary into data frame and then finally merge this into the actual data
Solution 1
import pandas as pd
df_data=pd.read_csv('list_characters.csv' , header=None, usecols=[1], names=['character_position'])
df_data = df_data.character_position.str.strip('[]').str.split(', ', expand=True)
df_data.columns = [df_data.columns % 5, df_data.columns // 5]
df_data = df_data.stack() # dont remove Index, it has the line from where this record was created
print df_data
df_data['FromLine'] = df_data.index.get_level_values(0) #assign line number to a column
cols = ['char','left','top','right','bottom','FromLine']
df_data.columns = cols #assign the new column names
#create a new dictionary
#it contains the line number as key and all the characters from that line as value
DictChar= {k: list(v) for k,v in df_data.groupby("FromLine")["char"]}
#convert dictionary to a dataframe
df_chars=pd.DataFrame(DictChar.items())
df_chars.columns=cols = ['FromLine','char']
# Merge dataframes on column 'FromLine'
df_final=df_data.merge(df_chars,on ='FromLine')
cols = ['char','left','top','right','bottom','FromLine','all_chars_in_same_row']
df_final.columns=cols
print df_final
Solution 2
I personally prefer this solution over the first one. See inline comments for more details
import pandas as pd
df_data=pd.read_csv('list_characters.csv', header=None, usecols=[1], names=['character_position'])
df_data = df_data.character_position.str.strip('[]').str.split(', ', expand=True)
x=len(df_data.columns) #get total number of columns
#get all characters from every 5th column, concatenate and create new column in df_data
df_data[x] = df_data[df_data.columns[::5]].apply(lambda x: ','.join(x.dropna()), axis=1)
# get index of each row. This is the line number for your record
df_data[x+1]=df_data.index.get_level_values(0)
# now set line number and character columns as Index of data frame
df_data.set_index([x+1,x],inplace=True,drop=True)
df_data.columns = [df_data.columns % 5, df_data.columns // 5]
df_data = df_data.stack()
df_data['FromLine'] = df_data.index.get_level_values(0) #assign line number to a column
df_data['all_chars_in_same_row'] = df_data.index.get_level_values(1) #assign character values to a column
cols = ['char','left','top','right','bottom','FromLine','all_chars_in_same_row']
df_data.columns=cols
df_data.reset_index(inplace=True) #remove mutiindexing
print df_data[cols]
Output
char left top right bottom from line all_chars_in_same_row
0 '.' 203 213 191 198 0 ['.', '3', 'C']
1 '3' 1758 1775 370 391 0 ['.', '3', 'C']
2 'C' 296 305 1492 1516 0 ['.', '3', 'C']
3 'A' 275 347 147 239 1 ['A', 'M', 'D']
4 'M' 2166 2184 370 391 1 ['A', 'M', 'D']
5 'D' 339 362 1815 1840 1 ['A', 'M', 'D']
6 'A' 73 91 373 394 2 ['A', 'D', 'A']
7 'D' 1395 1415 427 454 2 ['A', 'D', 'A']
8 'A' 1440 1455 2047 2073 2 ['A', 'D', 'A']
9 'D' 454 473 663 685 3 ['D', 'O', '0']
10 'O' 1533 1545 487 541 3 ['D', 'O', '0']
11 '0' 339 360 2137 2163 3 ['D', 'O', '0']
12 'A' 108 129 727 751 4 ['A', 'V', 'I']
13 'V' 1659 1677 490 514 4 ['A', 'V', 'I']
14 'I' 339 360 1860 1885 4 ['A', 'V', 'I']
15 'N' 34 51 949 970 5 ['N', '/', '2']
16 '/' 1890 1904 486 505 5 ['N', '/', '2']
17 '2' 1266 1283 1951 1977 5 ['N', '/', '2']
18 'S' 1368 1401 43 85 6 ['S', 'A', '8']
19 'A' 1344 1361 583 607 6 ['S', 'A', '8']
20 '8' 2207 2217 1492 1515 6 ['S', 'A', '8']
21 'S' 1437 1457 112 138 7 ['S', 'o', 'O']
22 'o' 1548 1580 979 1015 7 ['S', 'o', 'O']
23 'O' 1331 1349 370 391 7 ['S', 'o', 'O']
24 'h' 1686 1703 315 339 8 ['h', 't', 't']
25 't' 169 190 1291 1312 8 ['h', 't', 't']
26 't' 169 190 1291 1312 8 ['h', 't', 't']
27 'N' 1331 1349 370 391 9 ['N', 'C', 'C']
28 'C' 296 305 1492 1516 9 ['N', 'C', 'C']
29 'C' 296 305 1492 1516 9 ['N', 'C', 'C']

Changing multiple values in a list

I'm writing an encryption code so I don't know exactly where each character I need to replace will be:
I have put a text file into a list and then the list into ascii, But i need to replace 32 with a space. I get this printed out so far but I need to replace 32 on the second list with " "
Original = ['S', 'o', 'm', 'e', 'w', 'h', 'e', 'r', 'e', ' ', 'i', 'n', ' ', 'l', 'a', ' ', 'M', 'a', 'n', 'c', 'h', 'a', ',', ' ', 'i', 'n', ' ', 'a', ' ', 'p', 'l', 'a', 'c', 'e', ' ', 'w', 'h', 'o', 's', 'e', ' ', 'n', 'a', 'm', 'e', ' ', 'I', ' ', 'd', 'o', ' ', 'n', 'o', 't', ' ', 'c', 'a', 'r', 'e', ' ', 't', 'o', ' ', 'r', 'e', 'm', 'e', 'm', 'b', 'e', 'r', ',', ' ', 'a', ' ', 'g', 'e', 'n', 't', 'l', 'e', 'm', 'a', 'n', ' ', 'l', 'i', 'v', 'e', 'd', ' ', 'n', 'o', 't', ' ', 'l', 'o', 'n', 'g', ' ', 'a', 'g', 'o', ',', ' ', 'o', 'n', 'e', ' ', 'o', 'f', ' ', 't', 'h', 'o', 's', 'e', ' ', 'w', 'h', 'o', ' ', 'h', 'a', 's', ' ', 'a', ' ', 'l', 'a', 'n', 'c', 'e', ' ', 'a', 'n', 'd', ' ', 'a', 'n', 'c', 'i', 'e', 'n', 't', ' ', 's', 'h', 'i', 'e', 'l', 'd', ' ', 'o', 'n', ' ', 'a', ' ', 's', 'h', 'e', 'l', 'f', ' ', 'a', 'n', 'd', ' ', 'k', 'e', 'e', 'p', 's', ' ', 'a', ' ', 's', 'k', 'i', 'n', 'n', 'y', ' ', 'n', 'a', 'g', ' ', 'a', 'n', 'd', ' ', 'a', ' ', 'g', 'r', 'e', 'y', 'h', 'o', 'u', 'n', 'd', ' ', 'f', 'o', 'r', ' ', 'r', 'a', 'c', 'i', 'n', 'g', '.']
ASCII_conversion = [83, 111, 109, 101, 119, 104, 101, 114, 101, 32, 105, 110, 32, 108, 97, 32, 77, 97, 110, 99, 104, 97, 44, 32, 105, 110, 32, 97, 32, 112, 108, 97, 99, 101, 32, 119, 104, 111, 115, 101, 32, 110, 97, 109, 101, 32, 73, 32, 100, 111, 32, 110, 111, 116, 32, 99, 97, 114, 101, 32, 116, 111, 32, 114, 101, 109, 101, 109, 98, 101, 114, 44, 32, 97, 32, 103, 101, 110, 116, 108, 101, 109, 97, 110, 32, 108, 105, 118, 101, 100, 32, 110, 111, 116, 32, 108, 111, 110, 103, 32, 97, 103, 111, 44, 32, 111, 110, 101, 32, 111, 102, 32, 116, 104, 111, 115, 101, 32, 119, 104, 111, 32, 104, 97, 115, 32, 97, 32, 108, 97, 110, 99, 101, 32, 97, 110, 100, 32, 97, 110, 99, 105, 101, 110, 116, 32, 115, 104, 105, 101, 108, 100, 32, 111, 110, 32, 97, 32, 115, 104, 101, 108, 102, 32, 97, 110, 100, 32, 107, 101, 101, 112, 115, 32, 97, 32, 115, 107, 105, 110, 110, 121, 32, 110, 97, 103, 32, 97, 110, 100, 32, 97, 32, 103, 114, 101, 121, 104, 111, 117, 110, 100, 32, 102, 111, 114, 32, 114, 97, 99, 105, 110, 103, 46]
any help?

Add a new line of code after your current ASCII_conversion list:
ASCII_conversion = [x if x!=32 else " " for x in ASCII_conversion]

A simple list comprehension with TrueValue if condition else FalseValue will do:
ASCII_conversion = [' ' if x == 32 else x for x in ASCII_conversion]
or combining the two operations
ASCII_conversion = [' ' if x == ' ' else ord(x) for x in Original]
Note #1: The above piece of code will produce a list containing both strings and integers. This might not be a good approach. Try to have all elements of a list of the same type. It helps reduce errors.
Note #2: Please work on on your variable naming convention. It sort of hurts readers' eyes.

How can you convert a Python identifier into a number?

Reference: Is there a faster way of converting a number to a name?
In the question referenced above, a solution was found for turning a numbe into a name. This question asks just the opposite. How can you convert a name back into a number? So far, this is what I have:
>>> import string
>>> HEAD_CHAR = ''.join(sorted(string.ascii_letters + '_'))
>>> TAIL_CHAR = ''.join(sorted(string.digits + HEAD_CHAR))
>>> HEAD_BASE, TAIL_BASE = len(HEAD_CHAR), len(TAIL_CHAR)
>>> def number_to_name(number):
"Convert a number into a valid identifier."
if number < HEAD_BASE:
return HEAD_CHAR[number]
q, r = divmod(number - HEAD_BASE, TAIL_BASE)
return number_to_name(q) + TAIL_CHAR[r]
>>> [number_to_name(n) for n in range(117)]
['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A0', 'A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'AA', 'AB', 'AC', 'AD', 'AE', 'AF', 'AG', 'AH', 'AI', 'AJ', 'AK', 'AL', 'AM', 'AN', 'AO', 'AP', 'AQ', 'AR', 'AS', 'AT', 'AU', 'AV', 'AW', 'AX', 'AY', 'AZ', 'A_', 'Aa', 'Ab', 'Ac', 'Ad', 'Ae', 'Af', 'Ag', 'Ah', 'Ai', 'Aj', 'Ak', 'Al', 'Am', 'An', 'Ao', 'Ap', 'Aq', 'Ar', 'As', 'At', 'Au', 'Av', 'Aw', 'Ax', 'Ay', 'Az', 'B0']
>>> def name_to_number(name):
assert name, 'Name must exist!'
head, *tail = name
number = HEAD_CHAR.index(head)
for position, char in enumerate(tail):
if position:
number *= TAIL_BASE
else:
number += HEAD_BASE
number += TAIL_CHAR.index(char)
return number
>>> [name_to_number(number_to_name(n)) for n in range(117)]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 54]
The function number_to_name works perfectly, and name_to_number works up until it gets to number 116. At that point, the function returns 54 instead. Does anyone see the code's problem?
Solution based on recursive's answer:
import string
HEAD_CHAR = ''.join(sorted(string.ascii_letters + '_'))
TAIL_CHAR = ''.join(sorted(string.digits + HEAD_CHAR))
HEAD_BASE, TAIL_BASE = len(HEAD_CHAR), len(TAIL_CHAR)
def name_to_number(name):
if not name.isidentifier():
raise ValueError('Name must be a Python identifier!')
head, *tail = name
number = HEAD_CHAR.index(head)
for char in tail:
number *= TAIL_BASE
number += TAIL_CHAR.index(char)
return number + sum(HEAD_BASE * TAIL_BASE ** p for p in range(len(tail)))

Unfortunately, these identifiers don't yield to traditional constant base encoding techniques. For example "A" acts like a zero, but leading "A"s change the value. In normal number systems leading zeroes do not. There could be multiple approaches, but I settled on one that calculates the total number of identifiers with fewer digits, and starts from that.
def name_to_number(name):
assert name, 'Name must exist!'
skipped = sum(HEAD_BASE * TAIL_BASE ** i for i in range(len(name) - 1))
val = reduce(
lambda a,b: a * TAIL_BASE + TAIL_CHAR.index(b),
name[1:],
HEAD_CHAR.index(name[0]))
return val + skipped

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python3.7 i can't convert this byte to string - python

try to use this code: print(byte.decode('latin-1'))

Related

Is it possible to find the index of elements with char value higher than `n` with numpy?

Find the product of the minimum height of defenders lower than 180 cm and the maximum height of midfielders higher than 185 cm

Get the row index of each extracted character from csv file

Changing multiple values in a list

How can you convert a Python identifier into a number?

Categories

Resources