Get the row index of each extracted character from csv file - python

I have a column (second column called second_column) in my csv file which represents à list of characters and its positions as follow: the column called character_position
Each line of this column contains a list of character_position . overall l have 300 lines in this column each with list of character position
character_position = [['1', 1890, 1904, 486, 505, '8', 1905, 1916, 486, 507, '4', 1919, 1931, 486, 505, '1', 1935, 1947, 486, 505, '7', 1950, 1962, 486, 505, '2', 1965, 1976, 486, 505, '9', 1980, 1992, 486, 507, '6', 1995, 2007, 486, 505, '/', 2010, 2022, 484, 508, '4', 2025, 2037, 486, 505, '8', 2040, 2052, 486, 505, '3', 2057, 2067, 486, 507, '3', 2072, 2082, 486, 505, '0', 2085, 2097, 486, 507, '/', 2100, 2112, 484, 508, 'Q', 2115, 2127, 486, 507, '1', 2132, 2144, 486, 505, '7', 2147, 2157, 486, 505, '9', 2162, 2174, 486, 505, '/', 2175, 2189, 484, 508, 'C', 2190, 2204, 487, 505, '4', 2207, 2219, 486, 505, '1', 2241, 2253, 486, 505, '/', 2255, 2268, 484, 508, '1', 2271, 2285, 486, 507, '5', 2288, 2297, 486, 505], ['D', 2118, 2132, 519, 535, '.', 2138, 2144, 529, 534, '2', 2150, 2162, 516, 535, '0', 2165, 2177, 516, 535, '4', 2180, 2192, 516, 534, '7', 2196, 2208, 516, 534, '0', 2210, 2223, 514, 535, '1', 2226, 2238, 516, 534, '8', 2241, 2253, 514, 534, '2', 2256, 2267, 514, 535, '4', 2270, 2282, 516, 534, '0', 2285, 2298, 514, 535]]
each character has for values : left, top, right, bottom. For instance character '1' has left=1890, top=1904, right=486, bottom=505.
My file whole csv file is as follow :
df = pd.read_csv(filepath_or_buffer='list_characters.csv', header=None, usecols=[1], names=['character_position])
From this file l created a new csv file with five columns :
column 1: character, column 2 : left , column 3 : top, column 4 : right, column 5 : bottom.
cols = ['char','left','top','right','bottom']
df1 = df.character_position.str.strip('[]').str.split(', ', expand=True)
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)
df1.columns = cols
df1[cols[1:]] = df1[cols[1:]].astype(int)
print (df1)
char left top right bottom
0 'm' 38 104 2456 2492
1 'i' 40 102 2442 222
2 '.' 203 213 191 198
3 '3' 235 262 131 3333
4 'A' 275 347 147 239
5 'M' 363 465 145 3334
6 'A' 73 91 373 394
7 'D' 93 112 373 39
8 'D' 454 473 663 685
9 'O' 474 495 664 33
10 'A' 108 129 727 751
11 'V' 129 150 727 444
l want to add 2 other column called line_number and all_chars_in_same_row
1)line_number corresponds to the line where for example 'm' 38 104 2456 2492 is extracted let say from line 2
2) all_chars_in_same_row corresponds to all (spaced) characters which are in the same row. for instance
character_position = [['1', 1890, 1904, 486, 505, '8', 1905, 1916, 486, 507, '4', 1919, 1931, 486, 505, '1', 1935, 1947, 486, 505, '7', 1950, 1962, 486, 505, '2', 1965, 1976, 486, 505, '9', 1980, 1992, 486, 507, '6', 1995, 2007, 486, 505, '/', 2010, 2022, 484, 508, '4', 2025, 2037, 486, 505, '8', 2040, 2052, 486, 505, '3', 2057, 2067, 486, 507, '3', 2072, 2082, 486, 505, '0', 2085, 2097, 486, 507, '/', 2100, 2112, 484, 508, 'Q', 2115, 2127, 486, 507, '1', 2132, 2144, 486, 505, '7', 2147, 2157, 486, 505, '9', 2162, 2174, 486, 505, '/', 2175, 2189, 484, 508, 'C', 2190, 2204, 487, 505, '4', 2207, 2219, 486, 505, '1', 2241, 2253, 486, 505, '/', 2255, 2268, 484, 508, '1', 2271, 2285, 486, 507, '5', 2288, 2297, 486, 505], ['D', 2118, 2132, 519, 535, '.', 2138, 2144, 529, 534, '2', 2150, 2162, 516, 535, '0', 2165, 2177, 516, 535, '4', 2180, 2192, 516, 534, '7', 2196, 2208, 516, 534, '0', 2210, 2223, 514, 535, '1', 2226, 2238, 516, 534, '8', 2241, 2253, 514, 534, '2', 2256, 2267, 514, 535, '4', 2270, 2282, 516, 534, '0', 2285, 2298, 514, 535]]
l get '1' '8' '4' '1' '7' and so on.
more formally :
all_chars_in_same_row means: write all the character of the given row in line_number column
char left top right bottom line_number all_chars_in_same_row
0 'm' 38 104 2456 2492 from line 2 'm' '2' '5' 'g'
1 'i' 40 102 2442 222 from line 4
2 '.' 203 213 191 198 from line 6
3 '3' 235 262 131 3333
4 'A' 275 347 147 239
5 'M' 363 465 145 3334
6 'A' 73 91 373 394
7 'D' 93 112 373 39
8 'D' 454 473 663 685
9 'O' 474 495 664 33
10 'A' 108 129 727 751
11 'V' 129 150 727 444
EDIT1:
import pandas as pd
df_data=pd.read_csv('/home/ahmed/internship/cnn_ocr/list_characters.csv')
df_data.shape
(50, 3)
df_data.icol(1)
0 [['m', 38, 104, 2456, 2492, 'i', 40, 102, 2442...
1 [['.', 203, 213, 191, 198, '3', 235, 262, 131,...
2 [['A', 275, 347, 147, 239, 'M', 363, 465, 145,...
3 [['A', 73, 91, 373, 394, 'D', 93, 112, 373, 39...
4 [['D', 454, 473, 663, 685, 'O', 474, 495, 664,...
5 [['A', 108, 129, 727, 751, 'V', 129, 150, 727,...
6 [['N', 34, 51, 949, 970, '/', 52, 61, 948, 970...
7 [['S', 1368, 1401, 43, 85, 'A', 1406, 1446, 43...
8 [['S', 1437, 1457, 112, 138, 'o', 1458, 1476, ...
9 [['h', 1686, 1703, 315, 339, 't', 1706, 1715, ...
10 [['N', 1331, 1349, 370, 391, 'C', 1361, 1379, ...
11 [['N', 1758, 1775, 370, 391, 'D', 1785, 1803, ...
12 [['D', 2166, 2184, 370, 391, 'A', 2186, 2205, ...
13 [['2', 1395, 1415, 427, 454, '0', 1416, 1434, ...
14 [['I', 1533, 1545, 487, 541, 'I', 1548, 1551, ...
15 [['P', 1659, 1677, 490, 514, '2', 1680, 1697, ...
16 [['1', 1890, 1904, 486, 505, '8', 1905, 1916, ...
17 [['B', 1344, 1361, 583, 607, 'O', 1364, 1386, ...
18 [['B', 1548, 1580, 979, 1015, 'T', 1586, 1619,...
19 [['Q', 169, 190, 1291, 1312, 'U', 192, 210, 12...
20 [['1', 296, 305, 1492, 1516, 'S', 339, 357, 14...
21 [['G', 339, 362, 1815, 1840, 'S', 365, 384, 18...
22 [['2', 1440, 1455, 2047, 2073, '9', 1458, 1475...
23 [['R', 339, 360, 2137, 2163, 'e', 363, 378, 21...
24 [['R', 339, 360, 1860, 1885, 'e', 363, 380, 18...
25 [['0', 1266, 1283, 1951, 1977, ',', 1287, 1290...
26 [['1', 2207, 2217, 1492, 1515, '0', 2225, 2240...
27 [['1', 2364, 2382, 1552, 1585], [], ['E', 2369...
28 [['S', 2369, 2382, 1833, 1866]]
29 [['0', 2243, 2259, 1951, 1977, '0', 2271, 2288...
30 [['0', 2243, 2259, 2227, 2253, '0', 2271, 2286...
31 [['D', 76, 88, 2580, 2596, 'é', 91, 100, 2580,...
32 [['ü', 1474, 1489, 2586, 2616, '3', 1541, 1557...
33 [['E', 1440, 1461, 2670, 2697, 'U', 1466, 1488...
34 [['2', 1685, 1703, 2670, 2697, '.', 1707, 1712...
35 [['1', 2202, 2213, 2668, 2695, '3', 2220, 2237...
36 [['c', 88, 118, 2872, 2902]]
37 [['N', 127, 144, 2889, 2910, 'D', 156, 175, 28...
38 [['E', 108, 129, 3144, 3172, 'C', 133, 156, 31...
39 [['5', 108, 126, 3204, 3231, '0', 129, 147, 32...
40 [[]]
41 [['1', 480, 492, 3202, 3229, '6', 500, 518, 32...
42 [['P', 217, 234, 3337, 3360, 'A', 235, 255, 33...
43 [[]]
44 [['I', 954, 963, 2892, 2934, 'M', 969, 1011, 2...
45 [['E', 1385, 1407, 2970, 2998, 'U', 1410, 1433...
46 [['T', 2067, 2084, 2889, 2911, 'O', 2088, 2106...
47 [['1', 2201, 2213, 2970, 2997, '6', 2219, 2238...
48 [['M', 1734, 1755, 3246, 3267, 'O', 1758, 1779...
49 [['L', 923, 935, 3411, 3430, 'A', 941, 957, 34...
Name: character_position, dtype: object
Then in my char.csv l do the following
df = pd.read_csv('list_characters.csv', header=None, usecols=[1], names=['character_position'])
df = df.replace(['\[','\]'], ['',''], regex=True)
cols = ['char','left','right','top','bottom']
df1 = df.positionlrtb.str.strip('[]').str.split(',', expand=True)
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)
df1.columns = cols
df1.char = df1.char.replace(['\[','\]'], ['',''], regex=True)
df1['left']=df1['left'].replace(['\[','\]'], ['',''], regex=True)
df1['top']=df1['top'].replace(['\[','\]'], ['',''], regex=True)
df1['right']=df1['right'].replace(['\[','\]'], ['',''], regex=True)
df1['bottom']=df1['bottom'].replace(['\[','\]'], ['',''], regex=True)
df1.to_csv('chars.csv')
However l don't see in your response how you added the columns from_line and all_char_in_same_rows.
when l execute your line of code :
df_data = df_data.character_position.str.strip('[]').str.split(',', expand=True)
l get the following :
df_data[0:10]
0 1 2 3 4 5 6 7 8 9 ... \
0 'm' 38 104 2456 2492 'i' 40 102 2442 2448 ...
1 '.' 203 213 191 198 '3' 235 262 131 198 ...
2 'A' 275 347 147 239 'M' 363 465 145 239 ...
3 'A' 73 91 373 394 'D' 93 112 373 396 ...
4 'D' 454 473 663 685 'O' 474 495 664 687 ...
5 'A' 108 129 727 751 'V' 129 150 727 753 ...
6 'N' 34 51 949 970 '/' 52 61 948 970 ...
7 'S' 1368 1401 43 85 'A' 1406 1446 43 85 ...
8 'S' 1437 1457 112 138 'o' 1458 1476 118 138 ...
9 'h' 1686 1703 315 339 't' 1706 1715 316 339 ...
1821 1822 1823 1824 1825 1826 1827 1828 1829 1830
0 None None None None None None None None None None
1 None None None None None None None None None None
2 None None None None None None None None None None
3 None None None None None None None None None None
4 None None None None None None None None None None
5 None None None None None None None None None None
6 None None None None None None None None None None
Here are the 10 first lines of my csv file :
character_position
0 [['m', 38, 104, 2456, 2492, 'i', 40, 102, 2442, 2448, 'i', 40, 100, 2402, 2410, 'l', 40, 102, 2372, 2382, 'm', 40, 102, 2312, 2358, 'u', 40, 102, 2292, 2310, 'i', 40, 104, 2210, 2260, 'l', 40, 104, 2180, 2208, 'i', 40, 104, 2140, 2166, 'l', 40, 104, 2124, 2134]]
1 [['.', 203, 213, 191, 198, '3', 235, 262, 131, 198]]
2 [['A', 275, 347, 147, 239, 'M', 363, 465, 145, 239, 'S', 485, 549, 145, 243, 'U', 569, 631, 145, 241, 'N', 657, 733, 145, 239]]
3 [['A', 73, 91, 373, 394, 'D', 93, 112, 373, 396, 'R', 115, 133, 373, 396, 'E', 136, 153, 373, 396, 'S', 156, 172, 373, 396, 'S', 175, 192, 373, 396, 'E', 195, 211, 373, 396, 'D', 222, 241, 373, 396, 'E', 244, 261, 373, 396, 'L', 272, 285, 375, 396, 'I', 288, 293, 375, 396, 'V', 296, 314, 375, 396, 'R', 317, 334, 373, 396, 'A', 334, 354, 375, 396, 'I', 357, 360, 373, 396, 'S', 365, 381, 373, 396, 'O', 384, 405, 373, 396, 'N', 408, 425, 373, 394]]
4 [['D', 454, 473, 663, 685, 'O', 474, 495, 664, 687, 'C', 498, 516, 664, 687, 'U', 519, 536, 663, 687, 'M', 540, 561, 663, 687, 'E', 564, 581, 663, 685, 'N', 584, 600, 664, 685, 'T', 603, 618, 663, 685]]
5 [['A', 108, 129, 727, 751, 'V', 129, 150, 727, 753, 'O', 153, 175, 727, 753, 'I', 178, 183, 727, 751, 'R', 187, 210, 727, 751, 'S', 220, 240, 727, 753, 'U', 243, 263, 727, 753, 'R', 267, 288, 727, 751, 'F', 302, 318, 727, 751, 'A', 320, 341, 727, 751, 'C', 342, 363, 726, 751, 'T', 366, 384, 726, 750, 'U', 387, 407, 727, 751, 'R', 411, 432, 727, 751, 'E', 435, 453, 726, 751, 'P', 797, 815, 727, 751, 'A', 818, 839, 727, 751, 'G', 840, 863, 727, 751, 'E', 867, 885, 726, 751, '1', 900, 911, 727, 751, '1', 926, 934, 727, 751, '1', 947, 956, 727, 751, '5', 962, 979, 727, 751], ['R', 120, 142, 778, 807, 'T', 144, 165, 778, 805, 'T', 178, 199, 778, 805, 'e', 201, 219, 786, 807, 'c', 222, 240, 786, 807, 'h', 241, 258, 778, 807, 'n', 263, 279, 786, 807, 'i', 284, 287, 778, 805, 'c', 291, 308, 786, 807, 'a', 309, 327, 786, 807, 'R', 350, 374, 778, 807, 'e', 377, 395, 786, 807, 't', 396, 405, 780, 805, 'u', 408, 425, 786, 807, 'r', 429, 440, 786, 807, 'n', 441, 458, 786, 807, '-', 471, 482, 793, 798, 'D', 497, 518, 778, 807, 'O', 522, 548, 777, 807, 'A', 549, 573, 778, 807, '/', 585, 596, 778, 807, 'D', 606, 630, 778, 807, 'A', 632, 656, 778, 807, 'P', 659, 680, 778, 805]]
6 [['N', 34, 51, 949, 970, '/', 52, 61, 948, 970, 'C', 63, 81, 948, 970, 'O', 84, 103, 948, 970, 'M', 106, 127, 949, 970, 'M', 130, 151, 948, 970, 'A', 153, 172, 949, 970, 'N', 175, 192, 949, 970, 'D', 195, 213, 948, 970, 'E', 217, 232, 948, 970], ['1', 73, 84, 993, 1020, '1', 94, 105, 993, 1020, '8', 112, 130, 991, 1020, '4', 135, 153, 993, 1018, '5', 156, 172, 994, 1018, '7', 175, 192, 993, 1018, '6', 195, 213, 993, 1020, '0', 216, 235, 991, 1020, '6', 238, 257, 993, 1020, '5', 260, 278, 993, 1020, '0', 407, 425, 991, 1020, '9', 428, 446, 991, 1020, '.', 450, 455, 1015, 1020, '0', 459, 477, 991, 1020, '1', 485, 494, 994, 1018, '.', 503, 507, 1015, 1020, '2', 512, 530, 991, 1020, '0', 533, 551, 991, 1020, '1', 555, 566, 993, 1020, '5', 575, 593, 993, 1020, 'R', 632, 656, 991, 1020, 'M', 659, 684, 991, 1020, 'A', 689, 713, 991, 1020, 'N', 726, 747, 993, 1020, 'o', 752, 770, 999, 1020, '.', 774, 779, 1015, 1020, '5', 794, 812, 993, 1020, '8', 815, 833, 991, 1020, '4', 834, 852, 993, 1017, '4', 857, 873, 994, 1018, '3', 878, 896, 991, 1020, '8', 899, 917, 991, 1020, '0', 920, 938, 991, 1020, '/', 950, 960, 991, 1020, '0', 971, 990, 993, 1020, '7', 995, 1011, 993, 1018, '1', 1016, 1026, 993, 1018, '6', 1034, 1052, 993, 1020, '7', 1055, 1073, 993, 1020, '4', 1076, 1094, 993, 1018, '8', 1098, 1116, 991, 1020, '9', 1119, 1137, 991, 1020, '0', 1140, 1158, 993, 1020, '9', 1160, 1178, 991, 1020], ['N', 34, 51, 1045, 1066, '/', 54, 61, 1045, 1066, 'B', 63, 79, 1044, 1066, 'O', 82, 102, 1044, 1066, 'N', 105, 121, 1045, 1066, 'D', 133, 151, 1045, 1066, 'E', 156, 172, 1044, 1066, 'L', 183, 196, 1045, 1066, 'I', 199, 204, 1045, 1066, 'V', 205, 223, 1045, 1066, 'R', 226, 244, 1045, 1066, 'A', 246, 266, 1045, 1066, 'I', 267, 272, 1045, 1066, 'S', 275, 291, 1044, 1066, 'O', 294, 314, 1045, 1066, 'N', 318, 335, 1045, 1066], ['8', 72, 90, 1093, 1122, '2', 93, 109, 1093, 1122, '5', 114, 132, 1095, 1122, '9', 135, 153, 1093, 1122, '7', 154, 172, 1095, 1122, '1', 178, 189, 1093, 1122, '3', 196, 214, 1093, 1122, '1', 220, 231, 1095, 1122, '0', 238, 257, 1093, 1122, '3', 260, 278, 1093, 1122, '0', 407, 425, 1093, 1122, '6', 429, 447, 1095, 1122, '.', 452, 455, 1117, 1122, '0', 459, 477, 1093, 1122, '2', 480, 498, 1093, 1122, '.', 503, 507, 1117, 1122, '2', 512, 530, 1093, 1122, '0', 533, 551, 1093, 1122, '1', 557, 567, 1095, 1122, '5', 575, 593, 1095, 1122], ['v', 70, 90, 1150, 1171, '/', 88, 97, 1150, 1171, 'r', 100, 118, 1150, 1171, 'é', 121, 136, 1144, 1173, 'f', 141, 156, 1150, 1171, 'ê', 159, 174, 1144, 1173, 'r', 177, 195, 1150, 1173, 'e', 198, 214, 1150, 1171, 'n', 217, 234, 1150, 1171, 'c', 238, 257, 1149, 1171, 'e', 260, 276, 1149, 1173, 'B', 476, 497, 1152, 1179, 'O', 501, 527, 1149, 1179, 'G', 530, 555, 1150, 1180, 'D', 560, 582, 1152, 1179, 'O', 585, 611, 1149, 1179, 'A', 614, 638, 1150, 1179, '1', 642, 653, 1152, 1179, '5', 659, 677, 1153, 1180, 'B', 681, 701, 1152, 1179, 'T', 705, 726, 1152, 1179, '0', 728, 746, 1152, 1179, '6', 749, 767, 1152, 1179]]
7 [['S', 1368, 1401, 43, 85, 'A', 1406, 1446, 43, 85, 'M', 1451, 1491, 36, 85, 'S', 1500, 1533, 43, 85, 'U', 1539, 1574, 43, 85, 'N', 1581, 1616, 43, 85, 'G', 1623, 1662, 42, 85, 'E', 1686, 1719, 43, 85, 'L', 1725, 1755, 43, 85, 'E', 1763, 1794, 42, 85, 'C', 1800, 1836, 43, 85, 'T', 1841, 1874, 42, 85, 'R', 1880, 1914, 42, 84, 'O', 1919, 1959, 42, 85, 'N', 1965, 1998, 42, 84, 'I', 2007, 2016, 42, 84, 'C', 2022, 2058, 42, 84, 'S', 2066, 2099, 42, 84, 'F', 2121, 2151, 42, 84, 'R', 2159, 2193, 42, 84, 'A', 2198, 2237, 40, 84, 'N', 2243, 2277, 40, 84, 'C', 2285, 2321, 42, 84, 'E', 2328, 2360, 40, 84]]
8 [['S', 1437, 1457, 112, 138, 'o', 1458, 1476, 118, 138, 'c', 1479, 1493, 120, 138, 'i', 1494, 1499, 112, 136, 'é', 1503, 1518, 114, 138, 't', 1520, 1527, 115, 138, 'é', 1530, 1547, 112, 138, 'p', 1559, 1575, 120, 144, 'a', 1577, 1593, 118, 138, 'r', 1596, 1607, 118, 136, 'A', 1616, 1637, 112, 136, 'c', 1640, 1653, 118, 138, 't', 1655, 1664, 115, 136, 'i', 1665, 1670, 112, 136, 'o', 1673, 1688, 118, 138, 'n', 1692, 1707, 118, 136, 's', 1710, 1725, 118, 138, 'S', 1736, 1755, 112, 138, 'i', 1760, 1763, 112, 136, 'm', 1767, 1791, 118, 136, 'p', 1794, 1811, 118, 142, 'l', 1812, 1817, 112, 136, 'i', 1821, 1824, 112, 136, 'f', 1827, 1835, 112, 136, 'i', 1835, 1841, 112, 136, 'é', 1845, 1860, 112, 136, 'e', 1863, 1878, 118, 136, 'a', 1890, 1907, 118, 138, 'u', 1910, 1925, 118, 136, 'C', 1937, 1958, 112, 136, 'a', 1961, 1977, 118, 136, 'p', 1980, 1995, 118, 142, 'i', 1998, 2003, 112, 136, 't', 2006, 2013, 114, 136, 'a', 2015, 2030, 118, 136, 'l', 2034, 2037, 112, 136, 'd', 2051, 2066, 111, 136, 'e', 2069, 2085, 117, 136, '2', 2097, 2112, 112, 136, '7', 2115, 2132, 111, 136, '.', 2136, 2139, 132, 136, '0', 2144, 2159, 111, 136, '0', 2162, 2178, 111, 136, '0', 2180, 2196, 111, 136, '.', 2201, 2205, 132, 135, '0', 2208, 2225, 111, 136, '0', 2228, 2243, 111, 136, '0', 2246, 2261, 111, 136, 't', 2273, 2281, 112, 135, 'i', 2281, 2291, 111, 136], ['1', 1473, 1482, 153, 177, ',', 1491, 1494, 172, 181, 'r', 1508, 1517, 159, 177, 'u', 1520, 1535, 160, 177, 'e', 1538, 1554, 159, 177, 'F', 1566, 1583, 153, 177, 'r', 1587, 1596, 159, 177, 'u', 1598, 1613, 159, 177, 'c', 1617, 1631, 159, 177, 't', 1634, 1641, 154, 177, 'i', 1643, 1646, 153, 177, 'd', 1650, 1665, 151, 177, 'o', 1668, 1685, 159, 177, 'r', 1688, 1697, 159, 177, 'C', 1709, 1730, 153, 177, 'S', 1733, 1751, 153, 177, '2', 1764, 1779, 153, 177, '0', 1781, 1797, 153, 177, '0', 1800, 1817, 153, 177, '3', 1820, 1835, 151, 177, '9', 1847, 1863, 151, 177, '3', 1866, 1883, 151, 177, '4', 1883, 1901, 153, 175, '8', 1904, 1919, 151, 177, '4', 1919, 1937, 153, 175, 'S', 1950, 1968, 151, 177, 'A', 1971, 1992, 151, 175, 'I', 1995, 2000, 151, 175, 'N', 2004, 2024, 151, 175, 'T', 2027, 2046, 151, 175, 'O', 2058, 2081, 151, 177, 'U', 2085, 2105, 151, 177, 'E', 2109, 2127, 151, 177, 'N', 2130, 2150, 151, 175, 'C', 2163, 2186, 151, 175, 'e', 2187, 2204, 157, 175, 'd', 2207, 2222, 150, 175, 'e', 2225, 2240, 157, 175, 'x', 2243, 2258, 157, 175], ['T', 1638, 1656, 192, 216, 'É', 1659, 1677, 186, 217, 'L', 1682, 1697, 193, 217, 'É', 1701, 1719, 187, 217, 'P', 1722, 1742, 192, 217, 'H', 1746, 1766, 193, 217, 'O', 1770, 1793, 192, 217, 'N', 1796, 1815, 192, 216, 'E', 1820, 1838, 192, 217, '0', 1869, 1886, 190, 216, '1', 1890, 1899, 192, 216, '4', 1914, 1931, 193, 216, '4', 1934, 1950, 193, 216, '0', 1961, 1977, 190, 216, '4', 1980, 1997, 193, 216, '7', 2009, 2024, 192, 216, '0', 2027, 2042, 192, 216, '0', 2055, 2070, 192, 216, '0', 2073, 2090, 192, 216], ['R', 1517, 1538, 232, 258, '.', 1542, 1545, 253, 256, 'C', 1550, 1571, 232, 256, '.', 1575, 1580, 252, 256, 'S', 1584, 1602, 232, 256, '.', 1607, 1611, 252, 256, 'B', 1625, 1643, 232, 256, 'O', 1649, 1670, 231, 258, 'B', 1674, 1692, 232, 256, 'I', 1697, 1701, 232, 256, 'G', 1706, 1728, 232, 256, 'N', 1731, 1751, 232, 256, 'Y', 1754, 1775, 232, 256, 'B', 1788, 1806, 232, 256, '3', 1818, 1835, 231, 256, '3', 1838, 1855, 231, 256, '4', 1855, 1872, 232, 255, '3', 1884, 1899, 232, 256, '6', 1904, 1919, 232, 256, '7', 1922, 1937, 232, 256, '4', 1947, 1964, 232, 256, '9', 1967, 1983, 232, 256, '7', 1986, 2001, 232, 256, '-', 2013, 2022, 244, 249, 'A', 2034, 2055, 231, 255, 'P', 2057, 2075, 231, 255, 'E', 2079, 2097, 231, 256, '4', 2109, 2126, 232, 255, '6', 2129, 2145, 232, 256, '5', 2148, 2163, 232, 256, '2', 2166, 2183, 232, 255, 'Z', 2193, 2211, 231, 255], ['C', 1628, 1647, 271, 297, 'o', 1652, 1670, 279, 297, 'd', 1671, 1689, 273, 297, 'e', 1692, 1709, 279, 298, 'T', 1721, 1739, 273, 297, 'V', 1742, 1763, 273, 297, 'A', 1763, 1787, 273, 297, 'F', 1818, 1835, 273, 297, 'R', 1839, 1859, 273, 297, '8', 1872, 1889, 273, 297, '9', 1890, 1905, 273, 297, '3', 1919, 1932, 273, 297, '3', 1937, 1952, 273, 297, '4', 1953, 1971, 273, 297, '3', 1983, 1998, 273, 297, '6', 2001, 2018, 273, 297, '7', 2021, 2036, 273, 295, '4', 2048, 2064, 274, 297, '9', 2066, 2082, 273, 297, '7', 2085, 2100, 273, 295]]
9 [['h', 1686, 1703, 315, 339, 't', 1706, 1715, 316, 339, 't', 1718, 1727, 316, 339, 'p', 1730, 1748, 321, 345, 'i', 1751, 1757, 321, 339, 'f', 1760, 1769, 315, 339, '/', 1769, 1776, 313, 339, 'w', 1779, 1804, 321, 337, 'w', 1804, 1829, 321, 339, 'w', 1830, 1854, 321, 337, '.', 1859, 1863, 333, 337, 's', 1868, 1883, 319, 339, 'a', 1886, 1901, 321, 337, 'm', 1905, 1929, 321, 337, 's', 1932, 1949, 321, 339, 'u', 1953, 1968, 321, 339, 'n', 1973, 1989, 321, 339, 'g', 1992, 2010, 319, 345, '.', 2015, 2019, 333, 337, 'f', 2021, 2033, 313, 337, 'r', 2034, 2045, 319, 337]]
10 [['N', 1331, 1349, 370, 391, 'C', 1361, 1379, 370, 393, 'O', 1382, 1403, 370, 393, 'M', 1404, 1425, 370, 391, 'P', 1430, 1446, 370, 391, 'T', 1448, 1464, 370, 391, 'E', 1467, 1484, 370, 393, 'C', 1494, 1512, 370, 393, 'L', 1515, 1532, 370, 393, 'I', 1533, 1539, 370, 393, 'E', 1542, 1559, 370, 393, 'N', 1560, 1580, 370, 393, 'T', 1580, 1598, 370, 393]]
here is the second csv file:
char left right top bottom
0 'm' 38 104 2456 2492
1 'i' 40 102 2442 2448
2 'i' 40 100 2402 2410
3 'l' 40 102 2372 2382
4 'm' 40 102 2312 2358
5 'u' 40 102 2292 2310
6 'i' 40 104 2210 2260
7 'l' 40 104 2180 2208
8 'i' 40 104 2140 2166
EDIT1
HERE IS MY output for solution 2 (`input character_position described` )
1831 1830 level_2 char left top right bottom FromLine all_chars_in_same_row
0 0 character_position 0 character_position 0 character_position
1 1 'm','i','i','l','m','u','i','l','i','l' 0 'm' 38 104 2456 2492 1 'm','i','i','l','m','u','i','l','i','l'
2 1 'm','i','i','l','m','u','i','l','i','l' 1 'i' 40 102 2442 2448 1 'm','i','i','l','m','u','i','l','i','l'
3 1 'm','i','i','l','m','u','i','l','i','l' 2 'i' 40 100 2402 2410 1 'm','i','i','l','m','u','i','l','i','l'
l think the probelm comes from the fact that l have in my data :
[[',' , 'A', ',' , '.', ':' , ';', '1'], [], ['m', 'a',]]
so :
empty `[ ]` causes problem for the order. l noticed that when l tried to omit all [] which are empty beacause l find my csv as follow :
in char : ['a' rather than 'a' for values 8794] rather than 8794 or
[5345 rather than 5345
so processed the csv as follow
df = pd.read_csv(filepath_or_buffer='lit_charaters.csv', header=None, usecols=[1,3], names=['character_position','LineIndex'])
df = df.replace(['\[','\]'], ['',''], regex=True)
cols = ['char','left','right','top','bottom','LineIndex']
df1 = df.positionlrtb.str.strip('[]').str.split(',', expand=True)
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)
df1.columns = cols
df1.char = df1.char.replace(['\[','\]'], ['',''], regex=True)
df1['left']=df1['left'].replace(['\[','\]'], ['',''], regex=True)
df1['top']=df1['top'].replace(['\[','\]'], ['',''], regex=True)
df1['right']=df1['right'].replace(['\[','\]'], ['',''], regex=True)
df1['bottom']=df1['bottom'].replace(['\[','\]'], ['',''], regex=True)
df1.to_csv('char.csv')
Then l noticed the following
look at line 1221 column B it's empty it replaces [] then we get the disorder of columns switched (B and C) due to empty char . How to solve that ?
l have also empty line
3831 '6' 296 314 3204 3231
3832
3833 '1' 480 492 3202 3229
Line 3832 should be removed.
in order to get something like this
**EDIT2:**
In order to solve the problem of empty rows and [] in list_characters.csv
[['1', 2364, 2382, 1552, 1585], [], ['E', 2369, 2381, 1623, 1640], ['8', 2369, 2382, 1644, 1668]]
and
[[]] [[]]
l did the following :
df1 = df.applymap(lambda x: [y for y in x if len(y) > 0])
df1 = df1[df1.applymap(len).ne(0).all(axis=1)]
df1 = df.replace(['\[\],','\[\[\]\]', ''],['','', np.nan], regex=True)
df1 = df1.dropna()
then
df = pd.read_csv('character_position.csv', index_col=0)
df.positionlrtb = df.positionlrtb.apply(ast.literal_eval)
df.positionlrtb = df.positionlrtb.apply(lambda x: [y for y in x if len(y) > 0])
print (df.head())
page_number positionlrtb \
0 1841729699_001 [[m, 38, 104, 2456, 2492, i, 40, 102, 2442, 24...
1 1841729699_001 [[., 203, 213, 191, 198, 3, 235, 262, 131, 198]]
2 1841729699_001 [[A, 275, 347, 147, 239, M, 363, 465, 145, 239...
3 1841729699_001 [[A, 73, 91, 373, 394, D, 93, 112, 373, 396, R...
4 1841729699_001 [[D, 454, 473, 663, 685, O, 474, 495, 664, 687...
LineIndex
0 [[mi, il, mu, il, il]]
1 [[.3]]
2 [[amsun]]
3 [[adresse, de, livraison]]
4 [[document]]
cols = ['char','left','top','right','bottom']
df1 = pd.DataFrame({
"a": np.repeat(df.page_number.values, df.positionlrtb.str.len()),
"b": list(chain.from_iterable(df.positionlrtb))})
df1 = pd.DataFrame(df1.b.values.tolist())
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)
cols = ['char','left','top','right','bottom']
df1.columns = cols
df1[cols[1:]] = df1[cols[1:]].astype(int)
print (df1)
char left top right bottom
0 m 38 104 2456 2492
1 i 40 102 2442 2448
2 i 40 100 2402 2410
3 l 40 102 2372 2382
4 m 40 102 2312 2358
5 u 40 102 2292 2310
6 i 40 104 2210 2260
7 l 40 104 2180 2208
8 i 40 104 2140 2166
However :
df_data = df_data.character_position.str.strip('[]').str.split(', ', expand=True)
returns None values

Once you create required data frame, after stacking , don't remove Index, it holds your line number. Since this is a multilevel indexing , get the first Index- your line number.
df_data['LineIndex'] = df_data.index.get_level_values(0)
Then you can group by LineIndex column and get all characters for the common LineIndex. This is created as dictionary. Convert this dictionary into data frame and then finally merge this into the actual data
Solution 1
import pandas as pd
df_data=pd.read_csv('list_characters.csv' , header=None, usecols=[1], names=['character_position'])
df_data = df_data.character_position.str.strip('[]').str.split(', ', expand=True)
df_data.columns = [df_data.columns % 5, df_data.columns // 5]
df_data = df_data.stack() # dont remove Index, it has the line from where this record was created
print df_data
df_data['FromLine'] = df_data.index.get_level_values(0) #assign line number to a column
cols = ['char','left','top','right','bottom','FromLine']
df_data.columns = cols #assign the new column names
#create a new dictionary
#it contains the line number as key and all the characters from that line as value
DictChar= {k: list(v) for k,v in df_data.groupby("FromLine")["char"]}
#convert dictionary to a dataframe
df_chars=pd.DataFrame(DictChar.items())
df_chars.columns=cols = ['FromLine','char']
# Merge dataframes on column 'FromLine'
df_final=df_data.merge(df_chars,on ='FromLine')
cols = ['char','left','top','right','bottom','FromLine','all_chars_in_same_row']
df_final.columns=cols
print df_final
Solution 2
I personally prefer this solution over the first one. See inline comments for more details
import pandas as pd
df_data=pd.read_csv('list_characters.csv', header=None, usecols=[1], names=['character_position'])
df_data = df_data.character_position.str.strip('[]').str.split(', ', expand=True)
x=len(df_data.columns) #get total number of columns
#get all characters from every 5th column, concatenate and create new column in df_data
df_data[x] = df_data[df_data.columns[::5]].apply(lambda x: ','.join(x.dropna()), axis=1)
# get index of each row. This is the line number for your record
df_data[x+1]=df_data.index.get_level_values(0)
# now set line number and character columns as Index of data frame
df_data.set_index([x+1,x],inplace=True,drop=True)
df_data.columns = [df_data.columns % 5, df_data.columns // 5]
df_data = df_data.stack()
df_data['FromLine'] = df_data.index.get_level_values(0) #assign line number to a column
df_data['all_chars_in_same_row'] = df_data.index.get_level_values(1) #assign character values to a column
cols = ['char','left','top','right','bottom','FromLine','all_chars_in_same_row']
df_data.columns=cols
df_data.reset_index(inplace=True) #remove mutiindexing
print df_data[cols]
Output
char left top right bottom from line all_chars_in_same_row
0 '.' 203 213 191 198 0 ['.', '3', 'C']
1 '3' 1758 1775 370 391 0 ['.', '3', 'C']
2 'C' 296 305 1492 1516 0 ['.', '3', 'C']
3 'A' 275 347 147 239 1 ['A', 'M', 'D']
4 'M' 2166 2184 370 391 1 ['A', 'M', 'D']
5 'D' 339 362 1815 1840 1 ['A', 'M', 'D']
6 'A' 73 91 373 394 2 ['A', 'D', 'A']
7 'D' 1395 1415 427 454 2 ['A', 'D', 'A']
8 'A' 1440 1455 2047 2073 2 ['A', 'D', 'A']
9 'D' 454 473 663 685 3 ['D', 'O', '0']
10 'O' 1533 1545 487 541 3 ['D', 'O', '0']
11 '0' 339 360 2137 2163 3 ['D', 'O', '0']
12 'A' 108 129 727 751 4 ['A', 'V', 'I']
13 'V' 1659 1677 490 514 4 ['A', 'V', 'I']
14 'I' 339 360 1860 1885 4 ['A', 'V', 'I']
15 'N' 34 51 949 970 5 ['N', '/', '2']
16 '/' 1890 1904 486 505 5 ['N', '/', '2']
17 '2' 1266 1283 1951 1977 5 ['N', '/', '2']
18 'S' 1368 1401 43 85 6 ['S', 'A', '8']
19 'A' 1344 1361 583 607 6 ['S', 'A', '8']
20 '8' 2207 2217 1492 1515 6 ['S', 'A', '8']
21 'S' 1437 1457 112 138 7 ['S', 'o', 'O']
22 'o' 1548 1580 979 1015 7 ['S', 'o', 'O']
23 'O' 1331 1349 370 391 7 ['S', 'o', 'O']
24 'h' 1686 1703 315 339 8 ['h', 't', 't']
25 't' 169 190 1291 1312 8 ['h', 't', 't']
26 't' 169 190 1291 1312 8 ['h', 't', 't']
27 'N' 1331 1349 370 391 9 ['N', 'C', 'C']
28 'C' 296 305 1492 1516 9 ['N', 'C', 'C']
29 'C' 296 305 1492 1516 9 ['N', 'C', 'C']

Related

How to remove some extra tuples and lists and print some extra stuff between the strings in python?

bounds = reader.readtext(np.array(images[0]), min_size=0,slope_ths=0.2, ycenter_ths=0.7, height_ths=0.6,width_ths=0.8,decoder='beamsearch',beamWidth=10)
print(bounds)
The output for the above code is
I do have a string format like
[([[1004, 128], [1209, 128], [1209, 200], [1004, 200]],
'EC~L',
0.18826377391815186),
([[177, 179], [349, 179], [349, 241], [177, 241]], 'OKI', 0.9966741294455473),
([[180, 236], [422, 236], [422, 268], [180, 268]],
'Oki Eleclric Industry Co',
0.8091106257361781),
([[435, 243], [469, 243], [469, 263], [435, 263]], 'Ltd', 0.9978489622393302),
([[180, 265], [668, 265], [668, 293], [180, 293]],
'4-11-22 , Shibaura, Minalo-ku, Tokyo 108-855| Japan',
0.6109240973537998),
([[180, 291], [380, 291], [380, 318], [180, 318]],
'Tel +81-3-5440-4884',
0.9406430290171053)]
How to write a python code which prints the above format similar to below one:
[1004, 128, 1209, 128, 1209, 200, 1004, 200],
'EC~L'
##################
[177, 179, 349, 179, 349, 241, 177, 241],
'OKI'
##################
[180, 236, 422, 236, 422, 268, 180, 268],
'Oki Eleclric Industry Co'
##################
[435, 243, 469, 243, 469, 263, 435, 263],
'Ltd'
##################
[180, 265, 668, 265, 668, 293, 180, 293],
'4-11-22 , Shibaura, Minalo-ku, Tokyo 108-855| Japan'
##################
[180, 291, 380, 291, 380, 318, 180, 318],
'Tel +81-3-5440-4884'
Can anyone help me with this?
I think I'll contribute with this one-liner solution.
print("\n##################\n".join(("{},\n'{}'".format([x for item in items[0] for x in item], items[1])) for items in bounds))
Which produces the exact same format as the asker's desire:
[1004, 128, 1209, 128, 1209, 200, 1004, 200],
'EC~L'
##################
[177, 179, 349, 179, 349, 241, 177, 241],
'OKI'
##################
[180, 236, 422, 236, 422, 268, 180, 268],
'Oki Eleclric Industry Co'
##################
[435, 243, 469, 243, 469, 263, 435, 263],
'Ltd'
##################
[180, 265, 668, 265, 668, 293, 180, 293],
'4-11-22 , Shibaura, Minalo-ku, Tokyo 108-855| Japan'
##################
[180, 291, 380, 291, 380, 318, 180, 318],
'Tel +81-3-5440-4884'
If I understood correctly:
for l, s, _ in bounds:
print([lll for ll in l for lll in ll])
print(s)
print('##################')
Output:
[1004, 128, 1209, 128, 1209, 200, 1004, 200]
EC~L
##################
[177, 179, 349, 179, 349, 241, 177, 241]
OKI
##################
[180, 236, 422, 236, 422, 268, 180, 268]
Oki Eleclric Industry Co
##################
[435, 243, 469, 243, 469, 263, 435, 263]
Ltd
##################
[180, 265, 668, 265, 668, 293, 180, 293]
4-11-22 , Shibaura, Minalo-ku, Tokyo 108-855| Japan
##################
[180, 291, 380, 291, 380, 318, 180, 318]
Tel +81-3-5440-4884
##################
Another solution
from itertools import chain
for numbers, description, _ in bounds:
numbers = list(chain(*numbers))
print(f"{numbers},\n"
f"'{description}'\n"
"##################")
Output:
[1004, 128, 1209, 128, 1209, 200, 1004, 200],
'EC~L'
##################
[177, 179, 349, 179, 349, 241, 177, 241],
'OKI'
##################
[180, 236, 422, 236, 422, 268, 180, 268],
'Oki Eleclric Industry Co'
##################
[435, 243, 469, 243, 469, 263, 435, 263],
'Ltd'
##################
[180, 265, 668, 265, 668, 293, 180, 293],
'4-11-22 , Shibaura, Minalo-ku, Tokyo 108-855| Japan'
##################
[180, 291, 380, 291, 380, 318, 180, 318],
'Tel +81-3-5440-4884'
##################

If two columns satisfy a condition return calculated value from other columns. Pandas / Windows

Refer yellow highlighted cells:
If K = LDE then look for FDE in column J (above LDE's row), in Result column return (D from LDE minus A from FDE) (ie 223-307 = -84)
Refer green highlighted cells: 152-385 = -233 and so on.
How to solve ?
Data:
['03-01-2011', 523, 698, 284, 33, 416, 675, 300, 690, 314, '', '', 'FDM', ''] ['27-01-2011', 353, 1, 50, 547, 514, 957, 804, 490, 108, '', 'LDE', '', ''] ['28-01-2011', 307, 837, 656, 755, 792, 568, 119, 439, 943, 'FDE', '', '', ''] ['31-01-2011', 327, 409, 155, 358, 120, 401, 385, 965, 888, '', '', '', 'LDM'] ['01-02-2011', 686, 313, 714, 12, 140, 112, 589, 908, 605, '', '', 'FDM', ''] ['24-02-2011', 161, 846, 816, 223, 387, 566, 435, 567, 36, '', 'LDE', '', ''] ['25-02-2011', 889, 652, 190, 324, 947, 778, 575, 604, 314, 'FDE', '', '', ''] ['28-02-2011', 704, 33, 232, 630, 344, 796, 331, 409, 597, '', '', '', 'LDM'] ['01-03-2011', 592, 148, 974, 540, 848, 393, 505, 699, 315, '', '', 'FDM', ''] ['31-03-2011', 938, 768, 325, 756, 971, 644, 546, 238, 376, '', 'LDE', '', 'LDM'] ['01-04-2011', 385, 298, 654, 655, 2, 112, 960, 306, 477, 'FDE', '', 'FDM', ''] ['28-04-2011', 704, 516, 785, 152, 355, 348, 106, 611, 426, '', 'LDE', '', ''] ['29-04-2011', 753, 719, 776, 826, 756, 370, 660, 536, 903, 'FDE', '', '', 'LDM'] ['02-05-2011', 222, 28, 102, 363, 952, 860, 48, 976, 478, '', '', 'FDM', ''] ['26-05-2011', 361, 588, 866, 884, 809, 662, 801, 843, 668, '', 'LDE', '', '']
I found a quite tricky solution that works.
import pandas as pd
# define groups between two LDE
df['Group'] = (df['K'] == 'LDE').cumsum().shift(1, fill_value=0)
# custom function to perform your subtraction
def f(x):
if x.loc[x['J'] == 'FDE', 'A'].size == 0:
return None
else:
return x.loc[x['K'] == 'LDE', 'D'].iloc[0] - x.loc[x['J'] == 'FDE', 'A'].iloc[0]
# get list of numerical results
results = df.groupby('Group').apply(f).tolist()
# input the list into the specified LDE rows
df.loc[df['K'] == 'LDE', 'Results'] = results
Results
Starting data
df = pd.DataFrame([['03-01-2011', 523, 698, 284, 33, 416, 675, 300, 690, 314, '', '', 'FDM', ''], ['27-01-2011', 353, 1, 50, 547, 514, 957, 804, 490, 108, '', 'LDE', '', ''],
['28-01-2011', 307, 837, 656, 755, 792, 568, 119, 439, 943, 'FDE', '', '', ''], ['31-01-2011', 327, 409, 155, 358, 120, 401, 385, 965, 888, '', '', '', 'LDM'], ['01-02-2011', 686, 313, 714, 12, 140, 112, 589, 908, 605, '', '', 'FDM', ''], ['24-02-2011', 161, 846, 816, 223, 387, 566, 435, 567, 36, '', 'LDE', '', ''], ['25-02-2011', 889, 652, 190, 324, 947, 778, 575, 604, 314, 'FDE', '', '', ''], ['28-02-2011', 704, 33, 232, 630, 344, 796, 331, 409, 597, '', '', '', 'LDM'], ['01-03-2011', 592, 148, 974, 540, 848, 393, 505, 699, 315, '', '', 'FDM', ''], ['31-03-2011', 938, 768, 325, 756, 971, 644, 546, 238, 376, '', 'LDE', '', 'LDM'], ['01-04-2011', 385, 298, 654, 655, 2, 112, 960, 306, 477, 'FDE', '', 'FDM', ''], ['28-04-2011', 704, 516, 785, 152, 355, 348, 106, 611, 426, '', 'LDE', '', ''], ['29-04-2011', 753, 719, 776, 826, 756, 370, 660, 536, 903, 'FDE', '', '', 'LDM'], ['02-05-2011', 222, 28, 102, 363, 952, 860, 48, 976, 478, '', '', 'FDM', ''], ['26-05-2011', 361, 588, 866, 884, 809, 662, 801, 843, 668, '', 'LDE', '', '']],
columns=['Date'] + list(map(chr, range(65, 78))))
ugly but works...
find rows that contain FDE in column J
for each of the rows, find other row you want
finally do calc and return it
df = pd.DataFrame([['03-01-2011', 523, 698, 284, 33, 416, 675, 300, 690, 314, '', '', 'FDM', ''],
['27-01-2011', 353, 1, 50, 547, 514, 957, 804, 490, 108, '', 'LDE', '', ''] ,
['28-01-2011', 307, 837, 656, 755, 792, 568, 119, 439, 943, 'FDE', '', '', ''],
['31-01-2011', 327, 409, 155, 358, 120, 401, 385, 965, 888, '', '', '', 'LDM'] ,
['01-02-2011', 686, 313, 714, 12, 140, 112, 589, 908, 605, '', '', 'FDM', ''] ,
['24-02-2011', 161, 846, 816, 223, 387, 566, 435, 567, 36, '', 'LDE', '', ''] ,
['25-02-2011', 889, 652, 190, 324, 947, 778, 575, 604, 314, 'FDE', '', '', ''] ,
['28-02-2011', 704, 33, 232, 630, 344, 796, 331, 409, 597, '', '', '', 'LDM'] ,
['01-03-2011', 592, 148, 974, 540, 848, 393, 505, 699, 315, '', '', 'FDM', ''] ,
['31-03-2011', 938, 768, 325, 756, 971, 644, 546, 238, 376, '', 'LDE', '', 'LDM'],
['01-04-2011', 385, 298, 654, 655, 2, 112, 960, 306, 477, 'FDE', '', 'FDM', ''] ,
['28-04-2011', 704, 516, 785, 152, 355, 348, 106, 611, 426, '', 'LDE', '', ''] ,
['29-04-2011', 753, 719, 776, 826, 756, 370, 660, 536, 903, 'FDE', '', '', 'LDM'],
['02-05-2011', 222, 28, 102, 363, 952, 860, 48, 976, 478, '', '', 'FDM', ''] ,
['26-05-2011', 361, 588, 866, 884, 809, 662, 801, 843, 668, '', 'LDE', '', '']], columns=["Date"]+list("ABCDEFGHIJKLM"))
def findandcalc(lde):
# find last row from begining of DF, to place LDE was found that contains "FDE"
fde = df.iloc[0:lde.name].loc[lambda d: d["J"].eq("FDE")].tail(1)
# if row was found do calc and return it
return np.nan if len(fde)==0 else lde["D"] - fde["A"].values[0]
df.loc[df["K"].eq("LDE"), "Result"] = df.loc[df["K"].eq("LDE")].apply(findandcalc, axis=1)
df
vectorised solution
much cleaner....
filter to rows that have LDE or FDE in required columns
test that FDE is in previous row, then perform simple calc
join this series back to dataframe for final result
rs = df.loc[df["K"].eq("LDE") | df["J"].eq("FDE")].assign(
Result=lambda d: np.where(
d["K"].eq("LDE") & d["J"].shift().eq("FDE"), d["D"] - d["A"].shift(), np.nan
)
)["Result"]
df.join(rs)
Date
A
B
C
D
E
F
G
H
I
J
K
L
M
Result
03-01-2011
523
698
284
33
416
675
300
690
314
FDM
nan
27-01-2011
353
1
50
547
514
957
804
490
108
LDE
nan
28-01-2011
307
837
656
755
792
568
119
439
943
FDE
nan
31-01-2011
327
409
155
358
120
401
385
965
888
LDM
nan
01-02-2011
686
313
714
12
140
112
589
908
605
FDM
nan
24-02-2011
161
846
816
223
387
566
435
567
36
LDE
-84
25-02-2011
889
652
190
324
947
778
575
604
314
FDE
nan
28-02-2011
704
33
232
630
344
796
331
409
597
LDM
nan
01-03-2011
592
148
974
540
848
393
505
699
315
FDM
nan
31-03-2011
938
768
325
756
971
644
546
238
376
LDE
LDM
-133
01-04-2011
385
298
654
655
2
112
960
306
477
FDE
FDM
nan
28-04-2011
704
516
785
152
355
348
106
611
426
LDE
-233
29-04-2011
753
719
776
826
756
370
660
536
903
FDE
LDM
nan
02-05-2011
222
28
102
363
952
860
48
976
478
FDM
nan
26-05-2011
361
588
866
884
809
662
801
843
668
LDE
131

COCO .json file contains strange segmentation values in the annotattions, how to convert these?

I have a COCO format .json file which contains strange values in the annotation section. Most segmentations here are fine, but some contain size and counts in non human-readable format.
When training my model, I run into errors because of the weird segmentation values. I have read somewhere these are in RLE format but I am not sure. I should be able to use bitmask instead of polygon to train my model, but I prefer to handle the root cause and change these segmentations to the normal format. What is their type, can they be converted to normal segmentation format, and if so, how can I do that?
{'id': 20, 'image_id': 87, 'category_id': 2, 'segmentation': [[301, 303, 305, 288, 321, 261, 335, 236, 346, 214, 350, 209, 351, 205, 349, 202, 344, 203, 334, 221, 322, 244, 307, 272, 297, 290, 295, 302, 297, 310, 301, 309]], 'area': 829.5, 'bbox': [295, 202, 56, 108], 'iscrowd': 0}
{'id': 21, 'image_id': 87, 'category_id': 2, 'segmentation': [[292, 300, 288, 278, 287, 270, 283, 260, 280, 249, 276, 240, 273, 234, 270, 233, 268, 233, 266, 236, 268, 240, 272, 244, 274, 253, 276, 259, 277, 265, 280, 272, 281, 284, 285, 299, 288, 306, 291, 306, 292, 304]], 'area': 517.0, 'bbox': [266, 233, 26, 73], 'iscrowd': 0}
{'id': 22, 'image_id': 87, 'category_id': 2, 'segmentation': [[300, 279, 305, 249, 311, 233, 313, 224, 314, 211, 319, 185, 322, 172, 323, 162, 321, 155, 318, 158, 314, 168, 311, 189, 306, 217, 299, 228, 296, 237, 296, 245, 296, 254, 295, 260, 291, 279, 290, 289, 293, 295, 295, 293, 299, 287]], 'area': 1177.0, 'bbox': [290, 155, 33, 140], 'iscrowd': 0}
{'id': 23, 'image_id': 87, 'category_id': 2, 'segmentation': [[311, 308, 311, 299, 314, 292, 315, 286, 315, 282, 311, 282, 307, 284, 303, 294, 301, 303, 302, 308, 306, 307]], 'area': 235.5, 'bbox': [301, 282, 14, 26], 'iscrowd': 0}
#Weird values
{'id': 24, 'image_id': 27, 'category_id': 2, 'segmentation': {'size': [618, 561], 'counts': 'of[56Tc00O2O000001O00000OXjP5'}, 'area': 71, 'bbox': [284, 326, 10, 8], 'iscrowd': 0}
{'id': 25, 'image_id': 27, 'category_id': 1, 'segmentation': {'size': [618, 561], 'counts': 'fga54Pc0<H4L4M2O2M3M2N2N3N1N2N101N101O00000O10000O1000000000000000000000O100O100O2N1O1O2N2N3L4M3MdRU4'}, 'area': 1809, 'bbox': [294, 294, 46, 47], 'iscrowd': 0}
#Normal values again
{'id': 26, 'image_id': 61, 'category_id': 1, 'segmentation': [[285, 274, 285, 269, 281, 262, 276, 259, 271, 256, 266, 255, 257, 261, 251, 267, 251, 271, 250, 280, 251, 286, 254, 292, 258, 296, 261, 296, 265, 294, 272, 291, 277, 287, 280, 283, 283, 278]], 'area': 1024.0, 'bbox': [250, 255, 35, 41], 'iscrowd': 0}
{'id': 27, 'image_id': 61, 'category_id': 2, 'segmentation': [[167, 231, 175, 227, 180, 226, 188, 226, 198, 228, 215, 235, 228, 239, 235, 243, 259, 259, 255, 261, 252, 264, 226, 249, 216, 244, 203, 238, 194, 235, 184, 234, 171, 235, 167, 233]], 'area': 782.5, 'bbox': [167, 226, 92, 38], 'iscrowd': 0}
{'id': 28, 'image_id': 61, 'category_id': 2, 'segmentation': [[279, 186, 281, 188, 281, 192, 280, 195, 278, 200, 274, 210, 271, 218, 267, 228, 266, 233, 266, 236, 265, 239, 264, 256, 261, 257, 257, 259, 255, 244, 256, 240, 256, 238, 257, 234, 259, 227, 264, 216, 267, 205, 271, 195, 274, 190]], 'area': 593.0, 'bbox': [255, 186, 26, 73], 'iscrowd': 0}
{'id': 29, 'image_id': 61, 'category_id': 2, 'segmentation': [[264, 245, 267, 239, 269, 236, 276, 232, 280, 230, 285, 227, 287, 227, 288, 229, 287, 232, 284, 234, 282, 237, 280, 239, 276, 241, 274, 246, 271, 254, 269, 254, 266, 254, 264, 254]], 'area': 264.0, 'bbox': [264, 227, 24, 27], 'iscrowd': 0}
Find here all you need:
Interface for manipulating masks stored in RLE format
"RLE is a simple yet efficient format for storing binary masks. RLE first divides a vector (or vectorized image) into a series of piecewise
constant regions and then for each piece simply stores the length of that piece. For example, given M=[0 0 1 1 1 0 1] the RLE counts would be [2 3 1 1], or for M=[1 1 1 1 1 1 0] the counts would be [0 6 1] (note that the odd counts are always the numbers of zeros).
Instead of storing the counts directly, additional compression is achieved with a variable bitrate representation based on a common scheme called LEB128."
So, basically you can have the mask annotated as:
A polygon standard coco-json format (x,y,x,y,x,y, etc.),
A binary mask (image png)
An RLE encoded format.
All three are the same, but you need to convert them in the required format sometimes (in case your DL library doesn't support all of them, or converts them for you).

Python3.7 i can't convert this byte to string

I have this code:
byte = b'\x7f\x9fKL\xaa\xe6\xc8\x8d\xdf865\xf1s\t`R\xd6\xe8\x9c\x07\xae\x97\xe4\x0e\xe6\x08_CZY(1\x94\xca1\x165m\xd6m\x90xs\xc7\x90d\x0c\xe3\xe9;\x9ec\xd3Q\xe6\x11<z\xff:\x97\x9cz\x86{\xdd\x82S\xfc_\xbcow,`i<\xdd\x0f\xe0^\xb12\xdc,\xf5\x08\xdeey\xbb\xf4o\xadx\xc8(\xd0\xab)\xc1\x7f\xbe<z\xderLp\xa0\x02\x0c\x87!+q\x90\xae\x17\xd0\\y04\x1f\xae\xd2x\xc2\x92\xd4\xd5\x04\x9c\x9c\xc7\x0e\xcbxb\x81\xab\xe4w\xf4\xa1\x9f5\xb1p\xf1\xdf\x12^\x00lA\x83\xe1KP\xdb\xa93\x83\x13\x19\xb8\xf7RA\xe8\xe7\xdcU\xfc\xff\xbcJ\x9d\xc2\xba \xd5\xd5>\x15X#=\xf9\xdf\xbe\xee.\xc5\x82c\r\xd6\xad\x88=\xfc\x9f\xf4%+\xf5\ry\xb7\xb2\xabN\x1a\xb5$\xb6\x8b\x7f2sT\x9eo//\xb3\xbe\xdc\xc8\xbc\xc40\xae/P\xef\x1a\x0bP\x96R\xa0p\xe5\x8a\xad\x11\xe5u\xaa\xcbR'
print(str(byte,'utf-8'))
I want to convert this byte to string and and this to a json file so I can take the string and convert it back to byte when I want to use it.
but when I try to convert it gives such an error:
Traceback (most recent call last):
File "wallet.py", line 126, in <module>
print(str(byte,'utf-8'))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9f in position 1:
invalid start byte`
You could decode it like this,
>>> byte
b'\x7f\x9fKL\xaa\xe6\xc8\x8d\xdf865\xf1s\tR\xd6\xe8\x9c\x07\xae\x97\xe4\x0e\xe6\x08_CZY(1\x94\xca1\x165m\xd6m\x90xs\xc7\x90d\x0c\xe3\xe9;\x9ec\xd3Q\xe6\x11<z\xff:\x97\x9cz\x86{\xdd\x82S\xfc_\xbcow,i<\xdd\x0f\xe0^\xb12\xdc,\xf5\x08\xdeey\xbb\xf4o\xadx\xc8(\xd0\xab)\xc1\x7f\xbe\x15X#=\xf9\xdf\xbe\xee.\xc5\x82c\r\xd6\xad\x88=\xfc\x9f\xf4%+\xf5\ry\xb7\xb2\xabN\x1a\xb5$\xb6\x8b\x7f2sT\x9eo//\xb3\xbe\xdc\xc8\xbc\xc40\xae/P\xef\x1a\x0bP\x96R\xa0p\xe5\x8a\xad\x11\xe5u\xaa\xcbR'
>>> print(''.join(chr(x) for x in byte))
KLªæÈß865ñs RÖè®ä_CZY(1Ê15mÖmxsÇd
y·²«N▒µ$¶2sTo//³¾ÜȼÄ0®/Pï▒ ãé;cÓQæ<zÿ:z{ÝSü_¼ow,i<Ýà^±2Ü,Þey»ôo­xÈ(Ы)Á¾X#=ùß¾î.Åc
PR på­åuªËR
You can see what is going on here,
>>> y = [chr(x) for x in byte]
>>> y
['\x7f', '\x9f', 'K', 'L', 'ª', 'æ', 'È', '\x8d', 'ß', '8', '6', '5', 'ñ', 's', '\t', 'R', 'Ö', 'è', '\x9c', '\x07', '®', '\x97', 'ä', '\x0e', 'æ', '\x08', '_', 'C', 'Z', 'Y', '(', '1', '\x94', 'Ê', '1', '\x16', '5', 'm', 'Ö', 'm', '\x90', 'x', 's', 'Ç', '\x90', 'd', '\x0c', 'ã', 'é', ';', '\x9e', 'c', 'Ó', 'Q', 'æ', '\x11', '<', 'z', 'ÿ', ':', '\x97', '\x9c', 'z', '\x86', '{', 'Ý', '\x82', 'S', 'ü', '_', '¼', 'o', 'w', ',', 'i', '<', 'Ý', '\x0f', 'à', '^', '±', '2', 'Ü', ',', 'õ', '\x08', 'Þ', 'e', 'y', '»', 'ô', 'o', '\xad', 'x', 'È', '(', 'Ð', '«', ')', 'Á', '\x7f', '¾', '\x15', 'X', '#', '=', 'ù', 'ß', '¾', 'î', '.', 'Å', '\x82', 'c', '\r', 'Ö', '\xad', '\x88', '=', 'ü', '\x9f', 'ô', '%', '+', 'õ', '\r', 'y', '·', '²', '«', 'N', '\x1a', 'µ', '$', '¶', '\x8b', '\x7f', '2', 's', 'T', '\x9e', 'o', '/', '/', '³', '¾', 'Ü', 'È', '¼', 'Ä', '0', '®', '/', 'P', 'ï', '\x1a', '\x0b', 'P', '\x96', 'R', '\xa0', 'p', 'å', '\x8a', '\xad', '\x11', 'å', 'u', 'ª', 'Ë', 'R']
>>> [ord(x) for x in y]
[127, 159, 75, 76, 170, 230, 200, 141, 223, 56, 54, 53, 241, 115, 9, 82, 214, 232, 156, 7, 174, 151, 228, 14, 230, 8, 95, 67, 90, 89, 40, 49, 148, 202, 49, 22, 53, 109, 214, 109, 144, 120, 115, 199, 144, 100, 12, 227, 233, 59, 158, 99, 211, 81, 230, 17, 60, 122, 255, 58, 151, 156, 122, 134, 123, 221, 130, 83, 252, 95, 188, 111, 119, 44, 105, 60, 221, 15, 224, 94, 177, 50, 220, 44, 245, 8, 222, 101, 121, 187, 244, 111, 173, 120, 200, 40, 208, 171, 41, 193, 127, 190, 21, 88, 35, 61, 249, 223, 190, 238, 46, 197, 130, 99, 13, 214, 173, 136, 61, 252, 159, 244, 37, 43, 245, 13, 121, 183, 178, 171, 78, 26, 181, 36, 182, 139, 127, 50, 115, 84, 158, 111, 47, 47, 179, 190, 220, 200, 188, 196, 48, 174, 47, 80, 239, 26, 11, 80, 150, 82, 160, 112, 229, 138, 173, 17, 229, 117, 170, 203, 82]
>>> bytes([ord(x) for x in y])
b'\x7f\x9fKL\xaa\xe6\xc8\x8d\xdf865\xf1s\tR\xd6\xe8\x9c\x07\xae\x97\xe4\x0e\xe6\x08_CZY(1\x94\xca1\x165m\xd6m\x90xs\xc7\x90d\x0c\xe3\xe9;\x9ec\xd3Q\xe6\x11<z\xff:\x97\x9cz\x86{\xdd\x82S\xfc_\xbcow,i<\xdd\x0f\xe0^\xb12\xdc,\xf5\x08\xdeey\xbb\xf4o\xadx\xc8(\xd0\xab)\xc1\x7f\xbe\x15X#=\xf9\xdf\xbe\xee.\xc5\x82c\r\xd6\xad\x88=\xfc\x9f\xf4%+\xf5\ry\xb7\xb2\xabN\x1a\xb5$\xb6\x8b\x7f2sT\x9eo//\xb3\xbe\xdc\xc8\xbc\xc40\xae/P\xef\x1a\x0bP\x96R\xa0p\xe5\x8a\xad\x11\xe5u\xaa\xcbR'
>>>
>>> len(y)
171
>>> len(byte)
171
try to use this code:
print(byte.decode('latin-1'))

How to get a mapping of country codes to international number prefixes in Python? [duplicate]

This question already has answers here:
Get country name from Country code in python?
(3 answers)
Closed 4 years ago.
I'm interested in getting a mapping of country codes to international phone number prefixes, like so:
{'US': '+1', 'GB': '+44', 'DE': '+49', ...}
One library that probably contains this information is python-phonenumbers. However, after a quick perusal of the source code I wasn't able to find where this information is stored. For example, the shortdata/region_DE.py module looks like this:
"""Auto-generated file, do not edit by hand. DE metadata"""
from ..phonemetadata import NumberFormat, PhoneNumberDesc, PhoneMetadata
PHONE_METADATA_DE = PhoneMetadata(id='DE', country_code=None, international_prefix=None,
general_desc=PhoneNumberDesc(national_number_pattern='1\\d{2,5}', possible_length=(3, 6)),
toll_free=PhoneNumberDesc(national_number_pattern='116\\d{3}', example_number='116000', possible_length=(6,)),
emergency=PhoneNumberDesc(national_number_pattern='11[02]', example_number='112', possible_length=(3,)),
short_code=PhoneNumberDesc(national_number_pattern='11(?:[025]|6(?:00[06]|1(?:1[17]|23)))', example_number='115', possible_length=(3, 6)),
short_data=True)
It seems like the country_code and international_prefix fields are None. How can I get such a mapping (possibly with a different library)?
You can get the mapping you want using pycountry and phonenumbers, along with a simple dictionary comprehension:
import phonenumbers as pn
import pycountry
dct = {c.alpha_2: pn.country_code_for_region(c.alpha_2) for c in pycountry.countries}
print(dct)
Output:
{'SK': 421, 'KI': 686, 'LV': 371, 'GH': 233, 'JP': 81, 'SA': 966, 'TD': 235, 'SX': 1, 'CY': 357, 'CH': 41, 'EG': 20, 'PA': 507, 'KP': 850, 'CO': 57, 'GW': 245, 'KG': 996, 'AW': 297, 'FM': 691, 'SB': 677, 'HR': 385, 'PY': 595, 'BG': 359, 'IQ': 964, 'ID': 62, 'GQ': 240, 'CA': 1, 'CG': 242, 'MO': 853, 'SL': 232, 'LA': 856, 'OM': 968, 'MP': 1, 'DK': 45, 'FI': 358, 'DO': 1, 'BM': 1, 'GN': 224, 'NE': 227, 'ER': 291, 'DE': 49, 'UM': 0, 'CM': 237, 'PR': 1, 'RO': 40, 'AZ': 994, 'DZ': 213, 'BW': 267, 'MK': 389, 'HN': 504, 'IS': 354, 'SJ': 47, 'ME': 382, 'NR': 674, 'AD': 376, 'BY': 375, 'RE': 262, 'PG': 675, 'SO': 252, 'NO': 47, 'CC': 61, 'EE': 372, 'BN': 673, 'AU': 61, 'HM': 0, 'ML': 223, 'BD': 880, 'GE': 995, 'US': 1, 'UY': 598, 'SM': 378, 'NG': 234, 'BE': 32, 'KY': 1, 'AR': 54, 'CR': 506, 'VA': 39, 'YE': 967, 'TR': 90, 'CV': 238, 'DM': 1, 'ZM': 260, 'BR': 55, 'MG': 261, 'BL': 590, 'FJ': 679, 'SH': 290, 'KN': 1, 'ZA': 27, 'CF': 236, 'ZW': 263, 'PL': 48, 'SV': 503, 'QA': 974, 'MN': 976, 'SE': 46, 'JE': 44, 'PS': 970, 'MZ': 258, 'TK': 690, 'PM': 508, 'CW': 599, 'HK': 852, 'LB': 961, 'SY': 963, 'LC': 1, 'IE': 353, 'RW': 250, 'NL': 31, 'MA': 212, 'GM': 220, 'IR': 98, 'AT': 43, 'SZ': 268, 'GT': 502, 'MT': 356, 'BQ': 599, 'MX': 52, 'NC': 687, 'CK': 682, 'SI': 386, 'VE': 58, 'IM': 44, 'AM': 374, 'SD': 249, 'LY': 218, 'LI': 423, 'TN': 216, 'UG': 256, 'RU': 7, 'DJ': 253, 'IL': 972, 'TM': 993, 'BF': 226, 'GF': 594, 'TO': 676, 'GI': 350, 'MH': 692, 'UZ': 998, 'PF': 689, 'KZ': 7, 'GA': 241, 'PE': 51, 'TV': 688, 'BT': 975, 'MQ': 596, 'MF': 590, 'AF': 93, 'IN': 91, 'AX': 358, 'BH': 973, 'JM': 1, 'MY': 60, 'BO': 591, 'AI': 1, 'SR': 597, 'ET': 251, 'ES': 34, 'TF': 0, 'GU': 1, 'BJ': 229, 'SS': 211, 'KE': 254, 'BZ': 501, 'IO': 246, 'MU': 230, 'CL': 56, 'MD': 373, 'LU': 352, 'TJ': 992, 'EC': 593, 'VG': 1, 'NZ': 64, 'VU': 678, 'FO': 298, 'LR': 231, 'AL': 355, 'GB': 44, 'AS': 1, 'IT': 39, 'TC': 1, 'TW': 886, 'BI': 257, 'HU': 36, 'TL': 670, 'GG': 44, 'PN': 0, 'SG': 65, 'LS': 266, 'KH': 855, 'FR': 33, 'BV': 0, 'CX': 61, 'AE': 971, 'LT': 370, 'PT': 351, 'KR': 82, 'BB': 1, 'TG': 228, 'AQ': 0, 'EH': 212, 'AG': 1, 'VN': 84, 'CI': 225, 'BS': 1, 'GL': 299, 'MW': 265, 'NU': 683, 'NF': 672, 'LK': 94, 'MS': 1, 'GP': 590, 'NP': 977, 'PW': 680, 'PK': 92, 'WF': 681, 'BA': 387, 'KM': 269, 'JO': 962, 'CU': 53, 'GR': 30, 'YT': 262, 'RS': 381, 'NA': 264, 'ST': 239, 'SC': 248, 'CN': 86, 'CD': 243, 'GS': 0, 'KW': 965, 'MM': 95, 'AO': 244, 'MV': 960, 'UA': 380, 'TT': 1, 'FK': 500, 'WS': 685, 'CZ': 420, 'PH': 63, 'VI': 1, 'TZ': 255, 'MR': 222, 'MC': 377, 'SN': 221, 'HT': 509, 'VC': 1, 'NI': 505, 'GD': 1, 'GY': 592, 'TH': 66}
I have just found a python library that must be perfect for your problem.
It's called PhoneISO3166.
This is the github link: GitHub phoneiso3166

Categories

Resources