Get the row index of each extracted character from csv file - python
I have a column (second column called second_column) in my csv file which represents à list of characters and its positions as follow: the column called character_position
Each line of this column contains a list of character_position . overall l have 300 lines in this column each with list of character position
character_position = [['1', 1890, 1904, 486, 505, '8', 1905, 1916, 486, 507, '4', 1919, 1931, 486, 505, '1', 1935, 1947, 486, 505, '7', 1950, 1962, 486, 505, '2', 1965, 1976, 486, 505, '9', 1980, 1992, 486, 507, '6', 1995, 2007, 486, 505, '/', 2010, 2022, 484, 508, '4', 2025, 2037, 486, 505, '8', 2040, 2052, 486, 505, '3', 2057, 2067, 486, 507, '3', 2072, 2082, 486, 505, '0', 2085, 2097, 486, 507, '/', 2100, 2112, 484, 508, 'Q', 2115, 2127, 486, 507, '1', 2132, 2144, 486, 505, '7', 2147, 2157, 486, 505, '9', 2162, 2174, 486, 505, '/', 2175, 2189, 484, 508, 'C', 2190, 2204, 487, 505, '4', 2207, 2219, 486, 505, '1', 2241, 2253, 486, 505, '/', 2255, 2268, 484, 508, '1', 2271, 2285, 486, 507, '5', 2288, 2297, 486, 505], ['D', 2118, 2132, 519, 535, '.', 2138, 2144, 529, 534, '2', 2150, 2162, 516, 535, '0', 2165, 2177, 516, 535, '4', 2180, 2192, 516, 534, '7', 2196, 2208, 516, 534, '0', 2210, 2223, 514, 535, '1', 2226, 2238, 516, 534, '8', 2241, 2253, 514, 534, '2', 2256, 2267, 514, 535, '4', 2270, 2282, 516, 534, '0', 2285, 2298, 514, 535]]
each character has for values : left, top, right, bottom. For instance character '1' has left=1890, top=1904, right=486, bottom=505.
My file whole csv file is as follow :
df = pd.read_csv(filepath_or_buffer='list_characters.csv', header=None, usecols=[1], names=['character_position])
From this file l created a new csv file with five columns :
column 1: character, column 2 : left , column 3 : top, column 4 : right, column 5 : bottom.
cols = ['char','left','top','right','bottom']
df1 = df.character_position.str.strip('[]').str.split(', ', expand=True)
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)
df1.columns = cols
df1[cols[1:]] = df1[cols[1:]].astype(int)
print (df1)
char left top right bottom
0 'm' 38 104 2456 2492
1 'i' 40 102 2442 222
2 '.' 203 213 191 198
3 '3' 235 262 131 3333
4 'A' 275 347 147 239
5 'M' 363 465 145 3334
6 'A' 73 91 373 394
7 'D' 93 112 373 39
8 'D' 454 473 663 685
9 'O' 474 495 664 33
10 'A' 108 129 727 751
11 'V' 129 150 727 444
l want to add 2 other column called line_number and all_chars_in_same_row
1)line_number corresponds to the line where for example 'm' 38 104 2456 2492 is extracted let say from line 2
2) all_chars_in_same_row corresponds to all (spaced) characters which are in the same row. for instance
character_position = [['1', 1890, 1904, 486, 505, '8', 1905, 1916, 486, 507, '4', 1919, 1931, 486, 505, '1', 1935, 1947, 486, 505, '7', 1950, 1962, 486, 505, '2', 1965, 1976, 486, 505, '9', 1980, 1992, 486, 507, '6', 1995, 2007, 486, 505, '/', 2010, 2022, 484, 508, '4', 2025, 2037, 486, 505, '8', 2040, 2052, 486, 505, '3', 2057, 2067, 486, 507, '3', 2072, 2082, 486, 505, '0', 2085, 2097, 486, 507, '/', 2100, 2112, 484, 508, 'Q', 2115, 2127, 486, 507, '1', 2132, 2144, 486, 505, '7', 2147, 2157, 486, 505, '9', 2162, 2174, 486, 505, '/', 2175, 2189, 484, 508, 'C', 2190, 2204, 487, 505, '4', 2207, 2219, 486, 505, '1', 2241, 2253, 486, 505, '/', 2255, 2268, 484, 508, '1', 2271, 2285, 486, 507, '5', 2288, 2297, 486, 505], ['D', 2118, 2132, 519, 535, '.', 2138, 2144, 529, 534, '2', 2150, 2162, 516, 535, '0', 2165, 2177, 516, 535, '4', 2180, 2192, 516, 534, '7', 2196, 2208, 516, 534, '0', 2210, 2223, 514, 535, '1', 2226, 2238, 516, 534, '8', 2241, 2253, 514, 534, '2', 2256, 2267, 514, 535, '4', 2270, 2282, 516, 534, '0', 2285, 2298, 514, 535]]
l get '1' '8' '4' '1' '7' and so on.
more formally :
all_chars_in_same_row means: write all the character of the given row in line_number column
char left top right bottom line_number all_chars_in_same_row
0 'm' 38 104 2456 2492 from line 2 'm' '2' '5' 'g'
1 'i' 40 102 2442 222 from line 4
2 '.' 203 213 191 198 from line 6
3 '3' 235 262 131 3333
4 'A' 275 347 147 239
5 'M' 363 465 145 3334
6 'A' 73 91 373 394
7 'D' 93 112 373 39
8 'D' 454 473 663 685
9 'O' 474 495 664 33
10 'A' 108 129 727 751
11 'V' 129 150 727 444
EDIT1:
import pandas as pd
df_data=pd.read_csv('/home/ahmed/internship/cnn_ocr/list_characters.csv')
df_data.shape
(50, 3)
df_data.icol(1)
0 [['m', 38, 104, 2456, 2492, 'i', 40, 102, 2442...
1 [['.', 203, 213, 191, 198, '3', 235, 262, 131,...
2 [['A', 275, 347, 147, 239, 'M', 363, 465, 145,...
3 [['A', 73, 91, 373, 394, 'D', 93, 112, 373, 39...
4 [['D', 454, 473, 663, 685, 'O', 474, 495, 664,...
5 [['A', 108, 129, 727, 751, 'V', 129, 150, 727,...
6 [['N', 34, 51, 949, 970, '/', 52, 61, 948, 970...
7 [['S', 1368, 1401, 43, 85, 'A', 1406, 1446, 43...
8 [['S', 1437, 1457, 112, 138, 'o', 1458, 1476, ...
9 [['h', 1686, 1703, 315, 339, 't', 1706, 1715, ...
10 [['N', 1331, 1349, 370, 391, 'C', 1361, 1379, ...
11 [['N', 1758, 1775, 370, 391, 'D', 1785, 1803, ...
12 [['D', 2166, 2184, 370, 391, 'A', 2186, 2205, ...
13 [['2', 1395, 1415, 427, 454, '0', 1416, 1434, ...
14 [['I', 1533, 1545, 487, 541, 'I', 1548, 1551, ...
15 [['P', 1659, 1677, 490, 514, '2', 1680, 1697, ...
16 [['1', 1890, 1904, 486, 505, '8', 1905, 1916, ...
17 [['B', 1344, 1361, 583, 607, 'O', 1364, 1386, ...
18 [['B', 1548, 1580, 979, 1015, 'T', 1586, 1619,...
19 [['Q', 169, 190, 1291, 1312, 'U', 192, 210, 12...
20 [['1', 296, 305, 1492, 1516, 'S', 339, 357, 14...
21 [['G', 339, 362, 1815, 1840, 'S', 365, 384, 18...
22 [['2', 1440, 1455, 2047, 2073, '9', 1458, 1475...
23 [['R', 339, 360, 2137, 2163, 'e', 363, 378, 21...
24 [['R', 339, 360, 1860, 1885, 'e', 363, 380, 18...
25 [['0', 1266, 1283, 1951, 1977, ',', 1287, 1290...
26 [['1', 2207, 2217, 1492, 1515, '0', 2225, 2240...
27 [['1', 2364, 2382, 1552, 1585], [], ['E', 2369...
28 [['S', 2369, 2382, 1833, 1866]]
29 [['0', 2243, 2259, 1951, 1977, '0', 2271, 2288...
30 [['0', 2243, 2259, 2227, 2253, '0', 2271, 2286...
31 [['D', 76, 88, 2580, 2596, 'é', 91, 100, 2580,...
32 [['ü', 1474, 1489, 2586, 2616, '3', 1541, 1557...
33 [['E', 1440, 1461, 2670, 2697, 'U', 1466, 1488...
34 [['2', 1685, 1703, 2670, 2697, '.', 1707, 1712...
35 [['1', 2202, 2213, 2668, 2695, '3', 2220, 2237...
36 [['c', 88, 118, 2872, 2902]]
37 [['N', 127, 144, 2889, 2910, 'D', 156, 175, 28...
38 [['E', 108, 129, 3144, 3172, 'C', 133, 156, 31...
39 [['5', 108, 126, 3204, 3231, '0', 129, 147, 32...
40 [[]]
41 [['1', 480, 492, 3202, 3229, '6', 500, 518, 32...
42 [['P', 217, 234, 3337, 3360, 'A', 235, 255, 33...
43 [[]]
44 [['I', 954, 963, 2892, 2934, 'M', 969, 1011, 2...
45 [['E', 1385, 1407, 2970, 2998, 'U', 1410, 1433...
46 [['T', 2067, 2084, 2889, 2911, 'O', 2088, 2106...
47 [['1', 2201, 2213, 2970, 2997, '6', 2219, 2238...
48 [['M', 1734, 1755, 3246, 3267, 'O', 1758, 1779...
49 [['L', 923, 935, 3411, 3430, 'A', 941, 957, 34...
Name: character_position, dtype: object
Then in my char.csv l do the following
df = pd.read_csv('list_characters.csv', header=None, usecols=[1], names=['character_position'])
df = df.replace(['\[','\]'], ['',''], regex=True)
cols = ['char','left','right','top','bottom']
df1 = df.positionlrtb.str.strip('[]').str.split(',', expand=True)
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)
df1.columns = cols
df1.char = df1.char.replace(['\[','\]'], ['',''], regex=True)
df1['left']=df1['left'].replace(['\[','\]'], ['',''], regex=True)
df1['top']=df1['top'].replace(['\[','\]'], ['',''], regex=True)
df1['right']=df1['right'].replace(['\[','\]'], ['',''], regex=True)
df1['bottom']=df1['bottom'].replace(['\[','\]'], ['',''], regex=True)
df1.to_csv('chars.csv')
However l don't see in your response how you added the columns from_line and all_char_in_same_rows.
when l execute your line of code :
df_data = df_data.character_position.str.strip('[]').str.split(',', expand=True)
l get the following :
df_data[0:10]
0 1 2 3 4 5 6 7 8 9 ... \
0 'm' 38 104 2456 2492 'i' 40 102 2442 2448 ...
1 '.' 203 213 191 198 '3' 235 262 131 198 ...
2 'A' 275 347 147 239 'M' 363 465 145 239 ...
3 'A' 73 91 373 394 'D' 93 112 373 396 ...
4 'D' 454 473 663 685 'O' 474 495 664 687 ...
5 'A' 108 129 727 751 'V' 129 150 727 753 ...
6 'N' 34 51 949 970 '/' 52 61 948 970 ...
7 'S' 1368 1401 43 85 'A' 1406 1446 43 85 ...
8 'S' 1437 1457 112 138 'o' 1458 1476 118 138 ...
9 'h' 1686 1703 315 339 't' 1706 1715 316 339 ...
1821 1822 1823 1824 1825 1826 1827 1828 1829 1830
0 None None None None None None None None None None
1 None None None None None None None None None None
2 None None None None None None None None None None
3 None None None None None None None None None None
4 None None None None None None None None None None
5 None None None None None None None None None None
6 None None None None None None None None None None
Here are the 10 first lines of my csv file :
character_position
0 [['m', 38, 104, 2456, 2492, 'i', 40, 102, 2442, 2448, 'i', 40, 100, 2402, 2410, 'l', 40, 102, 2372, 2382, 'm', 40, 102, 2312, 2358, 'u', 40, 102, 2292, 2310, 'i', 40, 104, 2210, 2260, 'l', 40, 104, 2180, 2208, 'i', 40, 104, 2140, 2166, 'l', 40, 104, 2124, 2134]]
1 [['.', 203, 213, 191, 198, '3', 235, 262, 131, 198]]
2 [['A', 275, 347, 147, 239, 'M', 363, 465, 145, 239, 'S', 485, 549, 145, 243, 'U', 569, 631, 145, 241, 'N', 657, 733, 145, 239]]
3 [['A', 73, 91, 373, 394, 'D', 93, 112, 373, 396, 'R', 115, 133, 373, 396, 'E', 136, 153, 373, 396, 'S', 156, 172, 373, 396, 'S', 175, 192, 373, 396, 'E', 195, 211, 373, 396, 'D', 222, 241, 373, 396, 'E', 244, 261, 373, 396, 'L', 272, 285, 375, 396, 'I', 288, 293, 375, 396, 'V', 296, 314, 375, 396, 'R', 317, 334, 373, 396, 'A', 334, 354, 375, 396, 'I', 357, 360, 373, 396, 'S', 365, 381, 373, 396, 'O', 384, 405, 373, 396, 'N', 408, 425, 373, 394]]
4 [['D', 454, 473, 663, 685, 'O', 474, 495, 664, 687, 'C', 498, 516, 664, 687, 'U', 519, 536, 663, 687, 'M', 540, 561, 663, 687, 'E', 564, 581, 663, 685, 'N', 584, 600, 664, 685, 'T', 603, 618, 663, 685]]
5 [['A', 108, 129, 727, 751, 'V', 129, 150, 727, 753, 'O', 153, 175, 727, 753, 'I', 178, 183, 727, 751, 'R', 187, 210, 727, 751, 'S', 220, 240, 727, 753, 'U', 243, 263, 727, 753, 'R', 267, 288, 727, 751, 'F', 302, 318, 727, 751, 'A', 320, 341, 727, 751, 'C', 342, 363, 726, 751, 'T', 366, 384, 726, 750, 'U', 387, 407, 727, 751, 'R', 411, 432, 727, 751, 'E', 435, 453, 726, 751, 'P', 797, 815, 727, 751, 'A', 818, 839, 727, 751, 'G', 840, 863, 727, 751, 'E', 867, 885, 726, 751, '1', 900, 911, 727, 751, '1', 926, 934, 727, 751, '1', 947, 956, 727, 751, '5', 962, 979, 727, 751], ['R', 120, 142, 778, 807, 'T', 144, 165, 778, 805, 'T', 178, 199, 778, 805, 'e', 201, 219, 786, 807, 'c', 222, 240, 786, 807, 'h', 241, 258, 778, 807, 'n', 263, 279, 786, 807, 'i', 284, 287, 778, 805, 'c', 291, 308, 786, 807, 'a', 309, 327, 786, 807, 'R', 350, 374, 778, 807, 'e', 377, 395, 786, 807, 't', 396, 405, 780, 805, 'u', 408, 425, 786, 807, 'r', 429, 440, 786, 807, 'n', 441, 458, 786, 807, '-', 471, 482, 793, 798, 'D', 497, 518, 778, 807, 'O', 522, 548, 777, 807, 'A', 549, 573, 778, 807, '/', 585, 596, 778, 807, 'D', 606, 630, 778, 807, 'A', 632, 656, 778, 807, 'P', 659, 680, 778, 805]]
6 [['N', 34, 51, 949, 970, '/', 52, 61, 948, 970, 'C', 63, 81, 948, 970, 'O', 84, 103, 948, 970, 'M', 106, 127, 949, 970, 'M', 130, 151, 948, 970, 'A', 153, 172, 949, 970, 'N', 175, 192, 949, 970, 'D', 195, 213, 948, 970, 'E', 217, 232, 948, 970], ['1', 73, 84, 993, 1020, '1', 94, 105, 993, 1020, '8', 112, 130, 991, 1020, '4', 135, 153, 993, 1018, '5', 156, 172, 994, 1018, '7', 175, 192, 993, 1018, '6', 195, 213, 993, 1020, '0', 216, 235, 991, 1020, '6', 238, 257, 993, 1020, '5', 260, 278, 993, 1020, '0', 407, 425, 991, 1020, '9', 428, 446, 991, 1020, '.', 450, 455, 1015, 1020, '0', 459, 477, 991, 1020, '1', 485, 494, 994, 1018, '.', 503, 507, 1015, 1020, '2', 512, 530, 991, 1020, '0', 533, 551, 991, 1020, '1', 555, 566, 993, 1020, '5', 575, 593, 993, 1020, 'R', 632, 656, 991, 1020, 'M', 659, 684, 991, 1020, 'A', 689, 713, 991, 1020, 'N', 726, 747, 993, 1020, 'o', 752, 770, 999, 1020, '.', 774, 779, 1015, 1020, '5', 794, 812, 993, 1020, '8', 815, 833, 991, 1020, '4', 834, 852, 993, 1017, '4', 857, 873, 994, 1018, '3', 878, 896, 991, 1020, '8', 899, 917, 991, 1020, '0', 920, 938, 991, 1020, '/', 950, 960, 991, 1020, '0', 971, 990, 993, 1020, '7', 995, 1011, 993, 1018, '1', 1016, 1026, 993, 1018, '6', 1034, 1052, 993, 1020, '7', 1055, 1073, 993, 1020, '4', 1076, 1094, 993, 1018, '8', 1098, 1116, 991, 1020, '9', 1119, 1137, 991, 1020, '0', 1140, 1158, 993, 1020, '9', 1160, 1178, 991, 1020], ['N', 34, 51, 1045, 1066, '/', 54, 61, 1045, 1066, 'B', 63, 79, 1044, 1066, 'O', 82, 102, 1044, 1066, 'N', 105, 121, 1045, 1066, 'D', 133, 151, 1045, 1066, 'E', 156, 172, 1044, 1066, 'L', 183, 196, 1045, 1066, 'I', 199, 204, 1045, 1066, 'V', 205, 223, 1045, 1066, 'R', 226, 244, 1045, 1066, 'A', 246, 266, 1045, 1066, 'I', 267, 272, 1045, 1066, 'S', 275, 291, 1044, 1066, 'O', 294, 314, 1045, 1066, 'N', 318, 335, 1045, 1066], ['8', 72, 90, 1093, 1122, '2', 93, 109, 1093, 1122, '5', 114, 132, 1095, 1122, '9', 135, 153, 1093, 1122, '7', 154, 172, 1095, 1122, '1', 178, 189, 1093, 1122, '3', 196, 214, 1093, 1122, '1', 220, 231, 1095, 1122, '0', 238, 257, 1093, 1122, '3', 260, 278, 1093, 1122, '0', 407, 425, 1093, 1122, '6', 429, 447, 1095, 1122, '.', 452, 455, 1117, 1122, '0', 459, 477, 1093, 1122, '2', 480, 498, 1093, 1122, '.', 503, 507, 1117, 1122, '2', 512, 530, 1093, 1122, '0', 533, 551, 1093, 1122, '1', 557, 567, 1095, 1122, '5', 575, 593, 1095, 1122], ['v', 70, 90, 1150, 1171, '/', 88, 97, 1150, 1171, 'r', 100, 118, 1150, 1171, 'é', 121, 136, 1144, 1173, 'f', 141, 156, 1150, 1171, 'ê', 159, 174, 1144, 1173, 'r', 177, 195, 1150, 1173, 'e', 198, 214, 1150, 1171, 'n', 217, 234, 1150, 1171, 'c', 238, 257, 1149, 1171, 'e', 260, 276, 1149, 1173, 'B', 476, 497, 1152, 1179, 'O', 501, 527, 1149, 1179, 'G', 530, 555, 1150, 1180, 'D', 560, 582, 1152, 1179, 'O', 585, 611, 1149, 1179, 'A', 614, 638, 1150, 1179, '1', 642, 653, 1152, 1179, '5', 659, 677, 1153, 1180, 'B', 681, 701, 1152, 1179, 'T', 705, 726, 1152, 1179, '0', 728, 746, 1152, 1179, '6', 749, 767, 1152, 1179]]
7 [['S', 1368, 1401, 43, 85, 'A', 1406, 1446, 43, 85, 'M', 1451, 1491, 36, 85, 'S', 1500, 1533, 43, 85, 'U', 1539, 1574, 43, 85, 'N', 1581, 1616, 43, 85, 'G', 1623, 1662, 42, 85, 'E', 1686, 1719, 43, 85, 'L', 1725, 1755, 43, 85, 'E', 1763, 1794, 42, 85, 'C', 1800, 1836, 43, 85, 'T', 1841, 1874, 42, 85, 'R', 1880, 1914, 42, 84, 'O', 1919, 1959, 42, 85, 'N', 1965, 1998, 42, 84, 'I', 2007, 2016, 42, 84, 'C', 2022, 2058, 42, 84, 'S', 2066, 2099, 42, 84, 'F', 2121, 2151, 42, 84, 'R', 2159, 2193, 42, 84, 'A', 2198, 2237, 40, 84, 'N', 2243, 2277, 40, 84, 'C', 2285, 2321, 42, 84, 'E', 2328, 2360, 40, 84]]
8 [['S', 1437, 1457, 112, 138, 'o', 1458, 1476, 118, 138, 'c', 1479, 1493, 120, 138, 'i', 1494, 1499, 112, 136, 'é', 1503, 1518, 114, 138, 't', 1520, 1527, 115, 138, 'é', 1530, 1547, 112, 138, 'p', 1559, 1575, 120, 144, 'a', 1577, 1593, 118, 138, 'r', 1596, 1607, 118, 136, 'A', 1616, 1637, 112, 136, 'c', 1640, 1653, 118, 138, 't', 1655, 1664, 115, 136, 'i', 1665, 1670, 112, 136, 'o', 1673, 1688, 118, 138, 'n', 1692, 1707, 118, 136, 's', 1710, 1725, 118, 138, 'S', 1736, 1755, 112, 138, 'i', 1760, 1763, 112, 136, 'm', 1767, 1791, 118, 136, 'p', 1794, 1811, 118, 142, 'l', 1812, 1817, 112, 136, 'i', 1821, 1824, 112, 136, 'f', 1827, 1835, 112, 136, 'i', 1835, 1841, 112, 136, 'é', 1845, 1860, 112, 136, 'e', 1863, 1878, 118, 136, 'a', 1890, 1907, 118, 138, 'u', 1910, 1925, 118, 136, 'C', 1937, 1958, 112, 136, 'a', 1961, 1977, 118, 136, 'p', 1980, 1995, 118, 142, 'i', 1998, 2003, 112, 136, 't', 2006, 2013, 114, 136, 'a', 2015, 2030, 118, 136, 'l', 2034, 2037, 112, 136, 'd', 2051, 2066, 111, 136, 'e', 2069, 2085, 117, 136, '2', 2097, 2112, 112, 136, '7', 2115, 2132, 111, 136, '.', 2136, 2139, 132, 136, '0', 2144, 2159, 111, 136, '0', 2162, 2178, 111, 136, '0', 2180, 2196, 111, 136, '.', 2201, 2205, 132, 135, '0', 2208, 2225, 111, 136, '0', 2228, 2243, 111, 136, '0', 2246, 2261, 111, 136, 't', 2273, 2281, 112, 135, 'i', 2281, 2291, 111, 136], ['1', 1473, 1482, 153, 177, ',', 1491, 1494, 172, 181, 'r', 1508, 1517, 159, 177, 'u', 1520, 1535, 160, 177, 'e', 1538, 1554, 159, 177, 'F', 1566, 1583, 153, 177, 'r', 1587, 1596, 159, 177, 'u', 1598, 1613, 159, 177, 'c', 1617, 1631, 159, 177, 't', 1634, 1641, 154, 177, 'i', 1643, 1646, 153, 177, 'd', 1650, 1665, 151, 177, 'o', 1668, 1685, 159, 177, 'r', 1688, 1697, 159, 177, 'C', 1709, 1730, 153, 177, 'S', 1733, 1751, 153, 177, '2', 1764, 1779, 153, 177, '0', 1781, 1797, 153, 177, '0', 1800, 1817, 153, 177, '3', 1820, 1835, 151, 177, '9', 1847, 1863, 151, 177, '3', 1866, 1883, 151, 177, '4', 1883, 1901, 153, 175, '8', 1904, 1919, 151, 177, '4', 1919, 1937, 153, 175, 'S', 1950, 1968, 151, 177, 'A', 1971, 1992, 151, 175, 'I', 1995, 2000, 151, 175, 'N', 2004, 2024, 151, 175, 'T', 2027, 2046, 151, 175, 'O', 2058, 2081, 151, 177, 'U', 2085, 2105, 151, 177, 'E', 2109, 2127, 151, 177, 'N', 2130, 2150, 151, 175, 'C', 2163, 2186, 151, 175, 'e', 2187, 2204, 157, 175, 'd', 2207, 2222, 150, 175, 'e', 2225, 2240, 157, 175, 'x', 2243, 2258, 157, 175], ['T', 1638, 1656, 192, 216, 'É', 1659, 1677, 186, 217, 'L', 1682, 1697, 193, 217, 'É', 1701, 1719, 187, 217, 'P', 1722, 1742, 192, 217, 'H', 1746, 1766, 193, 217, 'O', 1770, 1793, 192, 217, 'N', 1796, 1815, 192, 216, 'E', 1820, 1838, 192, 217, '0', 1869, 1886, 190, 216, '1', 1890, 1899, 192, 216, '4', 1914, 1931, 193, 216, '4', 1934, 1950, 193, 216, '0', 1961, 1977, 190, 216, '4', 1980, 1997, 193, 216, '7', 2009, 2024, 192, 216, '0', 2027, 2042, 192, 216, '0', 2055, 2070, 192, 216, '0', 2073, 2090, 192, 216], ['R', 1517, 1538, 232, 258, '.', 1542, 1545, 253, 256, 'C', 1550, 1571, 232, 256, '.', 1575, 1580, 252, 256, 'S', 1584, 1602, 232, 256, '.', 1607, 1611, 252, 256, 'B', 1625, 1643, 232, 256, 'O', 1649, 1670, 231, 258, 'B', 1674, 1692, 232, 256, 'I', 1697, 1701, 232, 256, 'G', 1706, 1728, 232, 256, 'N', 1731, 1751, 232, 256, 'Y', 1754, 1775, 232, 256, 'B', 1788, 1806, 232, 256, '3', 1818, 1835, 231, 256, '3', 1838, 1855, 231, 256, '4', 1855, 1872, 232, 255, '3', 1884, 1899, 232, 256, '6', 1904, 1919, 232, 256, '7', 1922, 1937, 232, 256, '4', 1947, 1964, 232, 256, '9', 1967, 1983, 232, 256, '7', 1986, 2001, 232, 256, '-', 2013, 2022, 244, 249, 'A', 2034, 2055, 231, 255, 'P', 2057, 2075, 231, 255, 'E', 2079, 2097, 231, 256, '4', 2109, 2126, 232, 255, '6', 2129, 2145, 232, 256, '5', 2148, 2163, 232, 256, '2', 2166, 2183, 232, 255, 'Z', 2193, 2211, 231, 255], ['C', 1628, 1647, 271, 297, 'o', 1652, 1670, 279, 297, 'd', 1671, 1689, 273, 297, 'e', 1692, 1709, 279, 298, 'T', 1721, 1739, 273, 297, 'V', 1742, 1763, 273, 297, 'A', 1763, 1787, 273, 297, 'F', 1818, 1835, 273, 297, 'R', 1839, 1859, 273, 297, '8', 1872, 1889, 273, 297, '9', 1890, 1905, 273, 297, '3', 1919, 1932, 273, 297, '3', 1937, 1952, 273, 297, '4', 1953, 1971, 273, 297, '3', 1983, 1998, 273, 297, '6', 2001, 2018, 273, 297, '7', 2021, 2036, 273, 295, '4', 2048, 2064, 274, 297, '9', 2066, 2082, 273, 297, '7', 2085, 2100, 273, 295]]
9 [['h', 1686, 1703, 315, 339, 't', 1706, 1715, 316, 339, 't', 1718, 1727, 316, 339, 'p', 1730, 1748, 321, 345, 'i', 1751, 1757, 321, 339, 'f', 1760, 1769, 315, 339, '/', 1769, 1776, 313, 339, 'w', 1779, 1804, 321, 337, 'w', 1804, 1829, 321, 339, 'w', 1830, 1854, 321, 337, '.', 1859, 1863, 333, 337, 's', 1868, 1883, 319, 339, 'a', 1886, 1901, 321, 337, 'm', 1905, 1929, 321, 337, 's', 1932, 1949, 321, 339, 'u', 1953, 1968, 321, 339, 'n', 1973, 1989, 321, 339, 'g', 1992, 2010, 319, 345, '.', 2015, 2019, 333, 337, 'f', 2021, 2033, 313, 337, 'r', 2034, 2045, 319, 337]]
10 [['N', 1331, 1349, 370, 391, 'C', 1361, 1379, 370, 393, 'O', 1382, 1403, 370, 393, 'M', 1404, 1425, 370, 391, 'P', 1430, 1446, 370, 391, 'T', 1448, 1464, 370, 391, 'E', 1467, 1484, 370, 393, 'C', 1494, 1512, 370, 393, 'L', 1515, 1532, 370, 393, 'I', 1533, 1539, 370, 393, 'E', 1542, 1559, 370, 393, 'N', 1560, 1580, 370, 393, 'T', 1580, 1598, 370, 393]]
here is the second csv file:
char left right top bottom
0 'm' 38 104 2456 2492
1 'i' 40 102 2442 2448
2 'i' 40 100 2402 2410
3 'l' 40 102 2372 2382
4 'm' 40 102 2312 2358
5 'u' 40 102 2292 2310
6 'i' 40 104 2210 2260
7 'l' 40 104 2180 2208
8 'i' 40 104 2140 2166
EDIT1
HERE IS MY output for solution 2 (`input character_position described` )
1831 1830 level_2 char left top right bottom FromLine all_chars_in_same_row
0 0 character_position 0 character_position 0 character_position
1 1 'm','i','i','l','m','u','i','l','i','l' 0 'm' 38 104 2456 2492 1 'm','i','i','l','m','u','i','l','i','l'
2 1 'm','i','i','l','m','u','i','l','i','l' 1 'i' 40 102 2442 2448 1 'm','i','i','l','m','u','i','l','i','l'
3 1 'm','i','i','l','m','u','i','l','i','l' 2 'i' 40 100 2402 2410 1 'm','i','i','l','m','u','i','l','i','l'
l think the probelm comes from the fact that l have in my data :
[[',' , 'A', ',' , '.', ':' , ';', '1'], [], ['m', 'a',]]
so :
empty `[ ]` causes problem for the order. l noticed that when l tried to omit all [] which are empty beacause l find my csv as follow :
in char : ['a' rather than 'a' for values 8794] rather than 8794 or
[5345 rather than 5345
so processed the csv as follow
df = pd.read_csv(filepath_or_buffer='lit_charaters.csv', header=None, usecols=[1,3], names=['character_position','LineIndex'])
df = df.replace(['\[','\]'], ['',''], regex=True)
cols = ['char','left','right','top','bottom','LineIndex']
df1 = df.positionlrtb.str.strip('[]').str.split(',', expand=True)
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)
df1.columns = cols
df1.char = df1.char.replace(['\[','\]'], ['',''], regex=True)
df1['left']=df1['left'].replace(['\[','\]'], ['',''], regex=True)
df1['top']=df1['top'].replace(['\[','\]'], ['',''], regex=True)
df1['right']=df1['right'].replace(['\[','\]'], ['',''], regex=True)
df1['bottom']=df1['bottom'].replace(['\[','\]'], ['',''], regex=True)
df1.to_csv('char.csv')
Then l noticed the following
look at line 1221 column B it's empty it replaces [] then we get the disorder of columns switched (B and C) due to empty char . How to solve that ?
l have also empty line
3831 '6' 296 314 3204 3231
3832
3833 '1' 480 492 3202 3229
Line 3832 should be removed.
in order to get something like this
**EDIT2:**
In order to solve the problem of empty rows and [] in list_characters.csv
[['1', 2364, 2382, 1552, 1585], [], ['E', 2369, 2381, 1623, 1640], ['8', 2369, 2382, 1644, 1668]]
and
[[]] [[]]
l did the following :
df1 = df.applymap(lambda x: [y for y in x if len(y) > 0])
df1 = df1[df1.applymap(len).ne(0).all(axis=1)]
df1 = df.replace(['\[\],','\[\[\]\]', ''],['','', np.nan], regex=True)
df1 = df1.dropna()
then
df = pd.read_csv('character_position.csv', index_col=0)
df.positionlrtb = df.positionlrtb.apply(ast.literal_eval)
df.positionlrtb = df.positionlrtb.apply(lambda x: [y for y in x if len(y) > 0])
print (df.head())
page_number positionlrtb \
0 1841729699_001 [[m, 38, 104, 2456, 2492, i, 40, 102, 2442, 24...
1 1841729699_001 [[., 203, 213, 191, 198, 3, 235, 262, 131, 198]]
2 1841729699_001 [[A, 275, 347, 147, 239, M, 363, 465, 145, 239...
3 1841729699_001 [[A, 73, 91, 373, 394, D, 93, 112, 373, 396, R...
4 1841729699_001 [[D, 454, 473, 663, 685, O, 474, 495, 664, 687...
LineIndex
0 [[mi, il, mu, il, il]]
1 [[.3]]
2 [[amsun]]
3 [[adresse, de, livraison]]
4 [[document]]
cols = ['char','left','top','right','bottom']
df1 = pd.DataFrame({
"a": np.repeat(df.page_number.values, df.positionlrtb.str.len()),
"b": list(chain.from_iterable(df.positionlrtb))})
df1 = pd.DataFrame(df1.b.values.tolist())
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)
cols = ['char','left','top','right','bottom']
df1.columns = cols
df1[cols[1:]] = df1[cols[1:]].astype(int)
print (df1)
char left top right bottom
0 m 38 104 2456 2492
1 i 40 102 2442 2448
2 i 40 100 2402 2410
3 l 40 102 2372 2382
4 m 40 102 2312 2358
5 u 40 102 2292 2310
6 i 40 104 2210 2260
7 l 40 104 2180 2208
8 i 40 104 2140 2166
However :
df_data = df_data.character_position.str.strip('[]').str.split(', ', expand=True)
returns None values
Once you create required data frame, after stacking , don't remove Index, it holds your line number. Since this is a multilevel indexing , get the first Index- your line number.
df_data['LineIndex'] = df_data.index.get_level_values(0)
Then you can group by LineIndex column and get all characters for the common LineIndex. This is created as dictionary. Convert this dictionary into data frame and then finally merge this into the actual data
Solution 1
import pandas as pd
df_data=pd.read_csv('list_characters.csv' , header=None, usecols=[1], names=['character_position'])
df_data = df_data.character_position.str.strip('[]').str.split(', ', expand=True)
df_data.columns = [df_data.columns % 5, df_data.columns // 5]
df_data = df_data.stack() # dont remove Index, it has the line from where this record was created
print df_data
df_data['FromLine'] = df_data.index.get_level_values(0) #assign line number to a column
cols = ['char','left','top','right','bottom','FromLine']
df_data.columns = cols #assign the new column names
#create a new dictionary
#it contains the line number as key and all the characters from that line as value
DictChar= {k: list(v) for k,v in df_data.groupby("FromLine")["char"]}
#convert dictionary to a dataframe
df_chars=pd.DataFrame(DictChar.items())
df_chars.columns=cols = ['FromLine','char']
# Merge dataframes on column 'FromLine'
df_final=df_data.merge(df_chars,on ='FromLine')
cols = ['char','left','top','right','bottom','FromLine','all_chars_in_same_row']
df_final.columns=cols
print df_final
Solution 2
I personally prefer this solution over the first one. See inline comments for more details
import pandas as pd
df_data=pd.read_csv('list_characters.csv', header=None, usecols=[1], names=['character_position'])
df_data = df_data.character_position.str.strip('[]').str.split(', ', expand=True)
x=len(df_data.columns) #get total number of columns
#get all characters from every 5th column, concatenate and create new column in df_data
df_data[x] = df_data[df_data.columns[::5]].apply(lambda x: ','.join(x.dropna()), axis=1)
# get index of each row. This is the line number for your record
df_data[x+1]=df_data.index.get_level_values(0)
# now set line number and character columns as Index of data frame
df_data.set_index([x+1,x],inplace=True,drop=True)
df_data.columns = [df_data.columns % 5, df_data.columns // 5]
df_data = df_data.stack()
df_data['FromLine'] = df_data.index.get_level_values(0) #assign line number to a column
df_data['all_chars_in_same_row'] = df_data.index.get_level_values(1) #assign character values to a column
cols = ['char','left','top','right','bottom','FromLine','all_chars_in_same_row']
df_data.columns=cols
df_data.reset_index(inplace=True) #remove mutiindexing
print df_data[cols]
Output
char left top right bottom from line all_chars_in_same_row
0 '.' 203 213 191 198 0 ['.', '3', 'C']
1 '3' 1758 1775 370 391 0 ['.', '3', 'C']
2 'C' 296 305 1492 1516 0 ['.', '3', 'C']
3 'A' 275 347 147 239 1 ['A', 'M', 'D']
4 'M' 2166 2184 370 391 1 ['A', 'M', 'D']
5 'D' 339 362 1815 1840 1 ['A', 'M', 'D']
6 'A' 73 91 373 394 2 ['A', 'D', 'A']
7 'D' 1395 1415 427 454 2 ['A', 'D', 'A']
8 'A' 1440 1455 2047 2073 2 ['A', 'D', 'A']
9 'D' 454 473 663 685 3 ['D', 'O', '0']
10 'O' 1533 1545 487 541 3 ['D', 'O', '0']
11 '0' 339 360 2137 2163 3 ['D', 'O', '0']
12 'A' 108 129 727 751 4 ['A', 'V', 'I']
13 'V' 1659 1677 490 514 4 ['A', 'V', 'I']
14 'I' 339 360 1860 1885 4 ['A', 'V', 'I']
15 'N' 34 51 949 970 5 ['N', '/', '2']
16 '/' 1890 1904 486 505 5 ['N', '/', '2']
17 '2' 1266 1283 1951 1977 5 ['N', '/', '2']
18 'S' 1368 1401 43 85 6 ['S', 'A', '8']
19 'A' 1344 1361 583 607 6 ['S', 'A', '8']
20 '8' 2207 2217 1492 1515 6 ['S', 'A', '8']
21 'S' 1437 1457 112 138 7 ['S', 'o', 'O']
22 'o' 1548 1580 979 1015 7 ['S', 'o', 'O']
23 'O' 1331 1349 370 391 7 ['S', 'o', 'O']
24 'h' 1686 1703 315 339 8 ['h', 't', 't']
25 't' 169 190 1291 1312 8 ['h', 't', 't']
26 't' 169 190 1291 1312 8 ['h', 't', 't']
27 'N' 1331 1349 370 391 9 ['N', 'C', 'C']
28 'C' 296 305 1492 1516 9 ['N', 'C', 'C']
29 'C' 296 305 1492 1516 9 ['N', 'C', 'C']
Related
How to remove some extra tuples and lists and print some extra stuff between the strings in python?
bounds = reader.readtext(np.array(images[0]), min_size=0,slope_ths=0.2, ycenter_ths=0.7, height_ths=0.6,width_ths=0.8,decoder='beamsearch',beamWidth=10) print(bounds) The output for the above code is I do have a string format like [([[1004, 128], [1209, 128], [1209, 200], [1004, 200]], 'EC~L', 0.18826377391815186), ([[177, 179], [349, 179], [349, 241], [177, 241]], 'OKI', 0.9966741294455473), ([[180, 236], [422, 236], [422, 268], [180, 268]], 'Oki Eleclric Industry Co', 0.8091106257361781), ([[435, 243], [469, 243], [469, 263], [435, 263]], 'Ltd', 0.9978489622393302), ([[180, 265], [668, 265], [668, 293], [180, 293]], '4-11-22 , Shibaura, Minalo-ku, Tokyo 108-855| Japan', 0.6109240973537998), ([[180, 291], [380, 291], [380, 318], [180, 318]], 'Tel +81-3-5440-4884', 0.9406430290171053)] How to write a python code which prints the above format similar to below one: [1004, 128, 1209, 128, 1209, 200, 1004, 200], 'EC~L' ################## [177, 179, 349, 179, 349, 241, 177, 241], 'OKI' ################## [180, 236, 422, 236, 422, 268, 180, 268], 'Oki Eleclric Industry Co' ################## [435, 243, 469, 243, 469, 263, 435, 263], 'Ltd' ################## [180, 265, 668, 265, 668, 293, 180, 293], '4-11-22 , Shibaura, Minalo-ku, Tokyo 108-855| Japan' ################## [180, 291, 380, 291, 380, 318, 180, 318], 'Tel +81-3-5440-4884' Can anyone help me with this?
I think I'll contribute with this one-liner solution. print("\n##################\n".join(("{},\n'{}'".format([x for item in items[0] for x in item], items[1])) for items in bounds)) Which produces the exact same format as the asker's desire: [1004, 128, 1209, 128, 1209, 200, 1004, 200], 'EC~L' ################## [177, 179, 349, 179, 349, 241, 177, 241], 'OKI' ################## [180, 236, 422, 236, 422, 268, 180, 268], 'Oki Eleclric Industry Co' ################## [435, 243, 469, 243, 469, 263, 435, 263], 'Ltd' ################## [180, 265, 668, 265, 668, 293, 180, 293], '4-11-22 , Shibaura, Minalo-ku, Tokyo 108-855| Japan' ################## [180, 291, 380, 291, 380, 318, 180, 318], 'Tel +81-3-5440-4884'
If I understood correctly: for l, s, _ in bounds: print([lll for ll in l for lll in ll]) print(s) print('##################') Output: [1004, 128, 1209, 128, 1209, 200, 1004, 200] EC~L ################## [177, 179, 349, 179, 349, 241, 177, 241] OKI ################## [180, 236, 422, 236, 422, 268, 180, 268] Oki Eleclric Industry Co ################## [435, 243, 469, 243, 469, 263, 435, 263] Ltd ################## [180, 265, 668, 265, 668, 293, 180, 293] 4-11-22 , Shibaura, Minalo-ku, Tokyo 108-855| Japan ################## [180, 291, 380, 291, 380, 318, 180, 318] Tel +81-3-5440-4884 ##################
Another solution from itertools import chain for numbers, description, _ in bounds: numbers = list(chain(*numbers)) print(f"{numbers},\n" f"'{description}'\n" "##################") Output: [1004, 128, 1209, 128, 1209, 200, 1004, 200], 'EC~L' ################## [177, 179, 349, 179, 349, 241, 177, 241], 'OKI' ################## [180, 236, 422, 236, 422, 268, 180, 268], 'Oki Eleclric Industry Co' ################## [435, 243, 469, 243, 469, 263, 435, 263], 'Ltd' ################## [180, 265, 668, 265, 668, 293, 180, 293], '4-11-22 , Shibaura, Minalo-ku, Tokyo 108-855| Japan' ################## [180, 291, 380, 291, 380, 318, 180, 318], 'Tel +81-3-5440-4884' ##################
If two columns satisfy a condition return calculated value from other columns. Pandas / Windows
Refer yellow highlighted cells: If K = LDE then look for FDE in column J (above LDE's row), in Result column return (D from LDE minus A from FDE) (ie 223-307 = -84) Refer green highlighted cells: 152-385 = -233 and so on. How to solve ? Data: ['03-01-2011', 523, 698, 284, 33, 416, 675, 300, 690, 314, '', '', 'FDM', ''] ['27-01-2011', 353, 1, 50, 547, 514, 957, 804, 490, 108, '', 'LDE', '', ''] ['28-01-2011', 307, 837, 656, 755, 792, 568, 119, 439, 943, 'FDE', '', '', ''] ['31-01-2011', 327, 409, 155, 358, 120, 401, 385, 965, 888, '', '', '', 'LDM'] ['01-02-2011', 686, 313, 714, 12, 140, 112, 589, 908, 605, '', '', 'FDM', ''] ['24-02-2011', 161, 846, 816, 223, 387, 566, 435, 567, 36, '', 'LDE', '', ''] ['25-02-2011', 889, 652, 190, 324, 947, 778, 575, 604, 314, 'FDE', '', '', ''] ['28-02-2011', 704, 33, 232, 630, 344, 796, 331, 409, 597, '', '', '', 'LDM'] ['01-03-2011', 592, 148, 974, 540, 848, 393, 505, 699, 315, '', '', 'FDM', ''] ['31-03-2011', 938, 768, 325, 756, 971, 644, 546, 238, 376, '', 'LDE', '', 'LDM'] ['01-04-2011', 385, 298, 654, 655, 2, 112, 960, 306, 477, 'FDE', '', 'FDM', ''] ['28-04-2011', 704, 516, 785, 152, 355, 348, 106, 611, 426, '', 'LDE', '', ''] ['29-04-2011', 753, 719, 776, 826, 756, 370, 660, 536, 903, 'FDE', '', '', 'LDM'] ['02-05-2011', 222, 28, 102, 363, 952, 860, 48, 976, 478, '', '', 'FDM', ''] ['26-05-2011', 361, 588, 866, 884, 809, 662, 801, 843, 668, '', 'LDE', '', '']
I found a quite tricky solution that works. import pandas as pd # define groups between two LDE df['Group'] = (df['K'] == 'LDE').cumsum().shift(1, fill_value=0) # custom function to perform your subtraction def f(x): if x.loc[x['J'] == 'FDE', 'A'].size == 0: return None else: return x.loc[x['K'] == 'LDE', 'D'].iloc[0] - x.loc[x['J'] == 'FDE', 'A'].iloc[0] # get list of numerical results results = df.groupby('Group').apply(f).tolist() # input the list into the specified LDE rows df.loc[df['K'] == 'LDE', 'Results'] = results Results Starting data df = pd.DataFrame([['03-01-2011', 523, 698, 284, 33, 416, 675, 300, 690, 314, '', '', 'FDM', ''], ['27-01-2011', 353, 1, 50, 547, 514, 957, 804, 490, 108, '', 'LDE', '', ''], ['28-01-2011', 307, 837, 656, 755, 792, 568, 119, 439, 943, 'FDE', '', '', ''], ['31-01-2011', 327, 409, 155, 358, 120, 401, 385, 965, 888, '', '', '', 'LDM'], ['01-02-2011', 686, 313, 714, 12, 140, 112, 589, 908, 605, '', '', 'FDM', ''], ['24-02-2011', 161, 846, 816, 223, 387, 566, 435, 567, 36, '', 'LDE', '', ''], ['25-02-2011', 889, 652, 190, 324, 947, 778, 575, 604, 314, 'FDE', '', '', ''], ['28-02-2011', 704, 33, 232, 630, 344, 796, 331, 409, 597, '', '', '', 'LDM'], ['01-03-2011', 592, 148, 974, 540, 848, 393, 505, 699, 315, '', '', 'FDM', ''], ['31-03-2011', 938, 768, 325, 756, 971, 644, 546, 238, 376, '', 'LDE', '', 'LDM'], ['01-04-2011', 385, 298, 654, 655, 2, 112, 960, 306, 477, 'FDE', '', 'FDM', ''], ['28-04-2011', 704, 516, 785, 152, 355, 348, 106, 611, 426, '', 'LDE', '', ''], ['29-04-2011', 753, 719, 776, 826, 756, 370, 660, 536, 903, 'FDE', '', '', 'LDM'], ['02-05-2011', 222, 28, 102, 363, 952, 860, 48, 976, 478, '', '', 'FDM', ''], ['26-05-2011', 361, 588, 866, 884, 809, 662, 801, 843, 668, '', 'LDE', '', '']], columns=['Date'] + list(map(chr, range(65, 78))))
ugly but works... find rows that contain FDE in column J for each of the rows, find other row you want finally do calc and return it df = pd.DataFrame([['03-01-2011', 523, 698, 284, 33, 416, 675, 300, 690, 314, '', '', 'FDM', ''], ['27-01-2011', 353, 1, 50, 547, 514, 957, 804, 490, 108, '', 'LDE', '', ''] , ['28-01-2011', 307, 837, 656, 755, 792, 568, 119, 439, 943, 'FDE', '', '', ''], ['31-01-2011', 327, 409, 155, 358, 120, 401, 385, 965, 888, '', '', '', 'LDM'] , ['01-02-2011', 686, 313, 714, 12, 140, 112, 589, 908, 605, '', '', 'FDM', ''] , ['24-02-2011', 161, 846, 816, 223, 387, 566, 435, 567, 36, '', 'LDE', '', ''] , ['25-02-2011', 889, 652, 190, 324, 947, 778, 575, 604, 314, 'FDE', '', '', ''] , ['28-02-2011', 704, 33, 232, 630, 344, 796, 331, 409, 597, '', '', '', 'LDM'] , ['01-03-2011', 592, 148, 974, 540, 848, 393, 505, 699, 315, '', '', 'FDM', ''] , ['31-03-2011', 938, 768, 325, 756, 971, 644, 546, 238, 376, '', 'LDE', '', 'LDM'], ['01-04-2011', 385, 298, 654, 655, 2, 112, 960, 306, 477, 'FDE', '', 'FDM', ''] , ['28-04-2011', 704, 516, 785, 152, 355, 348, 106, 611, 426, '', 'LDE', '', ''] , ['29-04-2011', 753, 719, 776, 826, 756, 370, 660, 536, 903, 'FDE', '', '', 'LDM'], ['02-05-2011', 222, 28, 102, 363, 952, 860, 48, 976, 478, '', '', 'FDM', ''] , ['26-05-2011', 361, 588, 866, 884, 809, 662, 801, 843, 668, '', 'LDE', '', '']], columns=["Date"]+list("ABCDEFGHIJKLM")) def findandcalc(lde): # find last row from begining of DF, to place LDE was found that contains "FDE" fde = df.iloc[0:lde.name].loc[lambda d: d["J"].eq("FDE")].tail(1) # if row was found do calc and return it return np.nan if len(fde)==0 else lde["D"] - fde["A"].values[0] df.loc[df["K"].eq("LDE"), "Result"] = df.loc[df["K"].eq("LDE")].apply(findandcalc, axis=1) df vectorised solution much cleaner.... filter to rows that have LDE or FDE in required columns test that FDE is in previous row, then perform simple calc join this series back to dataframe for final result rs = df.loc[df["K"].eq("LDE") | df["J"].eq("FDE")].assign( Result=lambda d: np.where( d["K"].eq("LDE") & d["J"].shift().eq("FDE"), d["D"] - d["A"].shift(), np.nan ) )["Result"] df.join(rs) Date A B C D E F G H I J K L M Result 03-01-2011 523 698 284 33 416 675 300 690 314 FDM nan 27-01-2011 353 1 50 547 514 957 804 490 108 LDE nan 28-01-2011 307 837 656 755 792 568 119 439 943 FDE nan 31-01-2011 327 409 155 358 120 401 385 965 888 LDM nan 01-02-2011 686 313 714 12 140 112 589 908 605 FDM nan 24-02-2011 161 846 816 223 387 566 435 567 36 LDE -84 25-02-2011 889 652 190 324 947 778 575 604 314 FDE nan 28-02-2011 704 33 232 630 344 796 331 409 597 LDM nan 01-03-2011 592 148 974 540 848 393 505 699 315 FDM nan 31-03-2011 938 768 325 756 971 644 546 238 376 LDE LDM -133 01-04-2011 385 298 654 655 2 112 960 306 477 FDE FDM nan 28-04-2011 704 516 785 152 355 348 106 611 426 LDE -233 29-04-2011 753 719 776 826 756 370 660 536 903 FDE LDM nan 02-05-2011 222 28 102 363 952 860 48 976 478 FDM nan 26-05-2011 361 588 866 884 809 662 801 843 668 LDE 131
COCO .json file contains strange segmentation values in the annotattions, how to convert these?
I have a COCO format .json file which contains strange values in the annotation section. Most segmentations here are fine, but some contain size and counts in non human-readable format. When training my model, I run into errors because of the weird segmentation values. I have read somewhere these are in RLE format but I am not sure. I should be able to use bitmask instead of polygon to train my model, but I prefer to handle the root cause and change these segmentations to the normal format. What is their type, can they be converted to normal segmentation format, and if so, how can I do that? {'id': 20, 'image_id': 87, 'category_id': 2, 'segmentation': [[301, 303, 305, 288, 321, 261, 335, 236, 346, 214, 350, 209, 351, 205, 349, 202, 344, 203, 334, 221, 322, 244, 307, 272, 297, 290, 295, 302, 297, 310, 301, 309]], 'area': 829.5, 'bbox': [295, 202, 56, 108], 'iscrowd': 0} {'id': 21, 'image_id': 87, 'category_id': 2, 'segmentation': [[292, 300, 288, 278, 287, 270, 283, 260, 280, 249, 276, 240, 273, 234, 270, 233, 268, 233, 266, 236, 268, 240, 272, 244, 274, 253, 276, 259, 277, 265, 280, 272, 281, 284, 285, 299, 288, 306, 291, 306, 292, 304]], 'area': 517.0, 'bbox': [266, 233, 26, 73], 'iscrowd': 0} {'id': 22, 'image_id': 87, 'category_id': 2, 'segmentation': [[300, 279, 305, 249, 311, 233, 313, 224, 314, 211, 319, 185, 322, 172, 323, 162, 321, 155, 318, 158, 314, 168, 311, 189, 306, 217, 299, 228, 296, 237, 296, 245, 296, 254, 295, 260, 291, 279, 290, 289, 293, 295, 295, 293, 299, 287]], 'area': 1177.0, 'bbox': [290, 155, 33, 140], 'iscrowd': 0} {'id': 23, 'image_id': 87, 'category_id': 2, 'segmentation': [[311, 308, 311, 299, 314, 292, 315, 286, 315, 282, 311, 282, 307, 284, 303, 294, 301, 303, 302, 308, 306, 307]], 'area': 235.5, 'bbox': [301, 282, 14, 26], 'iscrowd': 0} #Weird values {'id': 24, 'image_id': 27, 'category_id': 2, 'segmentation': {'size': [618, 561], 'counts': 'of[56Tc00O2O000001O00000OXjP5'}, 'area': 71, 'bbox': [284, 326, 10, 8], 'iscrowd': 0} {'id': 25, 'image_id': 27, 'category_id': 1, 'segmentation': {'size': [618, 561], 'counts': 'fga54Pc0<H4L4M2O2M3M2N2N3N1N2N101N101O00000O10000O1000000000000000000000O100O100O2N1O1O2N2N3L4M3MdRU4'}, 'area': 1809, 'bbox': [294, 294, 46, 47], 'iscrowd': 0} #Normal values again {'id': 26, 'image_id': 61, 'category_id': 1, 'segmentation': [[285, 274, 285, 269, 281, 262, 276, 259, 271, 256, 266, 255, 257, 261, 251, 267, 251, 271, 250, 280, 251, 286, 254, 292, 258, 296, 261, 296, 265, 294, 272, 291, 277, 287, 280, 283, 283, 278]], 'area': 1024.0, 'bbox': [250, 255, 35, 41], 'iscrowd': 0} {'id': 27, 'image_id': 61, 'category_id': 2, 'segmentation': [[167, 231, 175, 227, 180, 226, 188, 226, 198, 228, 215, 235, 228, 239, 235, 243, 259, 259, 255, 261, 252, 264, 226, 249, 216, 244, 203, 238, 194, 235, 184, 234, 171, 235, 167, 233]], 'area': 782.5, 'bbox': [167, 226, 92, 38], 'iscrowd': 0} {'id': 28, 'image_id': 61, 'category_id': 2, 'segmentation': [[279, 186, 281, 188, 281, 192, 280, 195, 278, 200, 274, 210, 271, 218, 267, 228, 266, 233, 266, 236, 265, 239, 264, 256, 261, 257, 257, 259, 255, 244, 256, 240, 256, 238, 257, 234, 259, 227, 264, 216, 267, 205, 271, 195, 274, 190]], 'area': 593.0, 'bbox': [255, 186, 26, 73], 'iscrowd': 0} {'id': 29, 'image_id': 61, 'category_id': 2, 'segmentation': [[264, 245, 267, 239, 269, 236, 276, 232, 280, 230, 285, 227, 287, 227, 288, 229, 287, 232, 284, 234, 282, 237, 280, 239, 276, 241, 274, 246, 271, 254, 269, 254, 266, 254, 264, 254]], 'area': 264.0, 'bbox': [264, 227, 24, 27], 'iscrowd': 0}
Find here all you need: Interface for manipulating masks stored in RLE format "RLE is a simple yet efficient format for storing binary masks. RLE first divides a vector (or vectorized image) into a series of piecewise constant regions and then for each piece simply stores the length of that piece. For example, given M=[0 0 1 1 1 0 1] the RLE counts would be [2 3 1 1], or for M=[1 1 1 1 1 1 0] the counts would be [0 6 1] (note that the odd counts are always the numbers of zeros). Instead of storing the counts directly, additional compression is achieved with a variable bitrate representation based on a common scheme called LEB128." So, basically you can have the mask annotated as: A polygon standard coco-json format (x,y,x,y,x,y, etc.), A binary mask (image png) An RLE encoded format. All three are the same, but you need to convert them in the required format sometimes (in case your DL library doesn't support all of them, or converts them for you).
Python3.7 i can't convert this byte to string
I have this code: byte = b'\x7f\x9fKL\xaa\xe6\xc8\x8d\xdf865\xf1s\t`R\xd6\xe8\x9c\x07\xae\x97\xe4\x0e\xe6\x08_CZY(1\x94\xca1\x165m\xd6m\x90xs\xc7\x90d\x0c\xe3\xe9;\x9ec\xd3Q\xe6\x11<z\xff:\x97\x9cz\x86{\xdd\x82S\xfc_\xbcow,`i<\xdd\x0f\xe0^\xb12\xdc,\xf5\x08\xdeey\xbb\xf4o\xadx\xc8(\xd0\xab)\xc1\x7f\xbe<z\xderLp\xa0\x02\x0c\x87!+q\x90\xae\x17\xd0\\y04\x1f\xae\xd2x\xc2\x92\xd4\xd5\x04\x9c\x9c\xc7\x0e\xcbxb\x81\xab\xe4w\xf4\xa1\x9f5\xb1p\xf1\xdf\x12^\x00lA\x83\xe1KP\xdb\xa93\x83\x13\x19\xb8\xf7RA\xe8\xe7\xdcU\xfc\xff\xbcJ\x9d\xc2\xba \xd5\xd5>\x15X#=\xf9\xdf\xbe\xee.\xc5\x82c\r\xd6\xad\x88=\xfc\x9f\xf4%+\xf5\ry\xb7\xb2\xabN\x1a\xb5$\xb6\x8b\x7f2sT\x9eo//\xb3\xbe\xdc\xc8\xbc\xc40\xae/P\xef\x1a\x0bP\x96R\xa0p\xe5\x8a\xad\x11\xe5u\xaa\xcbR' print(str(byte,'utf-8')) I want to convert this byte to string and and this to a json file so I can take the string and convert it back to byte when I want to use it. but when I try to convert it gives such an error: Traceback (most recent call last): File "wallet.py", line 126, in <module> print(str(byte,'utf-8')) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9f in position 1: invalid start byte`
You could decode it like this, >>> byte b'\x7f\x9fKL\xaa\xe6\xc8\x8d\xdf865\xf1s\tR\xd6\xe8\x9c\x07\xae\x97\xe4\x0e\xe6\x08_CZY(1\x94\xca1\x165m\xd6m\x90xs\xc7\x90d\x0c\xe3\xe9;\x9ec\xd3Q\xe6\x11<z\xff:\x97\x9cz\x86{\xdd\x82S\xfc_\xbcow,i<\xdd\x0f\xe0^\xb12\xdc,\xf5\x08\xdeey\xbb\xf4o\xadx\xc8(\xd0\xab)\xc1\x7f\xbe\x15X#=\xf9\xdf\xbe\xee.\xc5\x82c\r\xd6\xad\x88=\xfc\x9f\xf4%+\xf5\ry\xb7\xb2\xabN\x1a\xb5$\xb6\x8b\x7f2sT\x9eo//\xb3\xbe\xdc\xc8\xbc\xc40\xae/P\xef\x1a\x0bP\x96R\xa0p\xe5\x8a\xad\x11\xe5u\xaa\xcbR' >>> print(''.join(chr(x) for x in byte)) KLªæÈß865ñs RÖè®ä_CZY(1Ê15mÖmxsÇd y·²«N▒µ$¶2sTo//³¾ÜȼÄ0®/Pï▒ ãé;cÓQæ<zÿ:z{ÝSü_¼ow,i<Ýà^±2Ü,Þey»ôoxÈ(Ы)Á¾X#=ùß¾î.Åc PR pååuªËR You can see what is going on here, >>> y = [chr(x) for x in byte] >>> y ['\x7f', '\x9f', 'K', 'L', 'ª', 'æ', 'È', '\x8d', 'ß', '8', '6', '5', 'ñ', 's', '\t', 'R', 'Ö', 'è', '\x9c', '\x07', '®', '\x97', 'ä', '\x0e', 'æ', '\x08', '_', 'C', 'Z', 'Y', '(', '1', '\x94', 'Ê', '1', '\x16', '5', 'm', 'Ö', 'm', '\x90', 'x', 's', 'Ç', '\x90', 'd', '\x0c', 'ã', 'é', ';', '\x9e', 'c', 'Ó', 'Q', 'æ', '\x11', '<', 'z', 'ÿ', ':', '\x97', '\x9c', 'z', '\x86', '{', 'Ý', '\x82', 'S', 'ü', '_', '¼', 'o', 'w', ',', 'i', '<', 'Ý', '\x0f', 'à', '^', '±', '2', 'Ü', ',', 'õ', '\x08', 'Þ', 'e', 'y', '»', 'ô', 'o', '\xad', 'x', 'È', '(', 'Ð', '«', ')', 'Á', '\x7f', '¾', '\x15', 'X', '#', '=', 'ù', 'ß', '¾', 'î', '.', 'Å', '\x82', 'c', '\r', 'Ö', '\xad', '\x88', '=', 'ü', '\x9f', 'ô', '%', '+', 'õ', '\r', 'y', '·', '²', '«', 'N', '\x1a', 'µ', '$', '¶', '\x8b', '\x7f', '2', 's', 'T', '\x9e', 'o', '/', '/', '³', '¾', 'Ü', 'È', '¼', 'Ä', '0', '®', '/', 'P', 'ï', '\x1a', '\x0b', 'P', '\x96', 'R', '\xa0', 'p', 'å', '\x8a', '\xad', '\x11', 'å', 'u', 'ª', 'Ë', 'R'] >>> [ord(x) for x in y] [127, 159, 75, 76, 170, 230, 200, 141, 223, 56, 54, 53, 241, 115, 9, 82, 214, 232, 156, 7, 174, 151, 228, 14, 230, 8, 95, 67, 90, 89, 40, 49, 148, 202, 49, 22, 53, 109, 214, 109, 144, 120, 115, 199, 144, 100, 12, 227, 233, 59, 158, 99, 211, 81, 230, 17, 60, 122, 255, 58, 151, 156, 122, 134, 123, 221, 130, 83, 252, 95, 188, 111, 119, 44, 105, 60, 221, 15, 224, 94, 177, 50, 220, 44, 245, 8, 222, 101, 121, 187, 244, 111, 173, 120, 200, 40, 208, 171, 41, 193, 127, 190, 21, 88, 35, 61, 249, 223, 190, 238, 46, 197, 130, 99, 13, 214, 173, 136, 61, 252, 159, 244, 37, 43, 245, 13, 121, 183, 178, 171, 78, 26, 181, 36, 182, 139, 127, 50, 115, 84, 158, 111, 47, 47, 179, 190, 220, 200, 188, 196, 48, 174, 47, 80, 239, 26, 11, 80, 150, 82, 160, 112, 229, 138, 173, 17, 229, 117, 170, 203, 82] >>> bytes([ord(x) for x in y]) b'\x7f\x9fKL\xaa\xe6\xc8\x8d\xdf865\xf1s\tR\xd6\xe8\x9c\x07\xae\x97\xe4\x0e\xe6\x08_CZY(1\x94\xca1\x165m\xd6m\x90xs\xc7\x90d\x0c\xe3\xe9;\x9ec\xd3Q\xe6\x11<z\xff:\x97\x9cz\x86{\xdd\x82S\xfc_\xbcow,i<\xdd\x0f\xe0^\xb12\xdc,\xf5\x08\xdeey\xbb\xf4o\xadx\xc8(\xd0\xab)\xc1\x7f\xbe\x15X#=\xf9\xdf\xbe\xee.\xc5\x82c\r\xd6\xad\x88=\xfc\x9f\xf4%+\xf5\ry\xb7\xb2\xabN\x1a\xb5$\xb6\x8b\x7f2sT\x9eo//\xb3\xbe\xdc\xc8\xbc\xc40\xae/P\xef\x1a\x0bP\x96R\xa0p\xe5\x8a\xad\x11\xe5u\xaa\xcbR' >>> >>> len(y) 171 >>> len(byte) 171
try to use this code: print(byte.decode('latin-1'))
How to get a mapping of country codes to international number prefixes in Python? [duplicate]
This question already has answers here: Get country name from Country code in python? (3 answers) Closed 4 years ago. I'm interested in getting a mapping of country codes to international phone number prefixes, like so: {'US': '+1', 'GB': '+44', 'DE': '+49', ...} One library that probably contains this information is python-phonenumbers. However, after a quick perusal of the source code I wasn't able to find where this information is stored. For example, the shortdata/region_DE.py module looks like this: """Auto-generated file, do not edit by hand. DE metadata""" from ..phonemetadata import NumberFormat, PhoneNumberDesc, PhoneMetadata PHONE_METADATA_DE = PhoneMetadata(id='DE', country_code=None, international_prefix=None, general_desc=PhoneNumberDesc(national_number_pattern='1\\d{2,5}', possible_length=(3, 6)), toll_free=PhoneNumberDesc(national_number_pattern='116\\d{3}', example_number='116000', possible_length=(6,)), emergency=PhoneNumberDesc(national_number_pattern='11[02]', example_number='112', possible_length=(3,)), short_code=PhoneNumberDesc(national_number_pattern='11(?:[025]|6(?:00[06]|1(?:1[17]|23)))', example_number='115', possible_length=(3, 6)), short_data=True) It seems like the country_code and international_prefix fields are None. How can I get such a mapping (possibly with a different library)?
You can get the mapping you want using pycountry and phonenumbers, along with a simple dictionary comprehension: import phonenumbers as pn import pycountry dct = {c.alpha_2: pn.country_code_for_region(c.alpha_2) for c in pycountry.countries} print(dct) Output: {'SK': 421, 'KI': 686, 'LV': 371, 'GH': 233, 'JP': 81, 'SA': 966, 'TD': 235, 'SX': 1, 'CY': 357, 'CH': 41, 'EG': 20, 'PA': 507, 'KP': 850, 'CO': 57, 'GW': 245, 'KG': 996, 'AW': 297, 'FM': 691, 'SB': 677, 'HR': 385, 'PY': 595, 'BG': 359, 'IQ': 964, 'ID': 62, 'GQ': 240, 'CA': 1, 'CG': 242, 'MO': 853, 'SL': 232, 'LA': 856, 'OM': 968, 'MP': 1, 'DK': 45, 'FI': 358, 'DO': 1, 'BM': 1, 'GN': 224, 'NE': 227, 'ER': 291, 'DE': 49, 'UM': 0, 'CM': 237, 'PR': 1, 'RO': 40, 'AZ': 994, 'DZ': 213, 'BW': 267, 'MK': 389, 'HN': 504, 'IS': 354, 'SJ': 47, 'ME': 382, 'NR': 674, 'AD': 376, 'BY': 375, 'RE': 262, 'PG': 675, 'SO': 252, 'NO': 47, 'CC': 61, 'EE': 372, 'BN': 673, 'AU': 61, 'HM': 0, 'ML': 223, 'BD': 880, 'GE': 995, 'US': 1, 'UY': 598, 'SM': 378, 'NG': 234, 'BE': 32, 'KY': 1, 'AR': 54, 'CR': 506, 'VA': 39, 'YE': 967, 'TR': 90, 'CV': 238, 'DM': 1, 'ZM': 260, 'BR': 55, 'MG': 261, 'BL': 590, 'FJ': 679, 'SH': 290, 'KN': 1, 'ZA': 27, 'CF': 236, 'ZW': 263, 'PL': 48, 'SV': 503, 'QA': 974, 'MN': 976, 'SE': 46, 'JE': 44, 'PS': 970, 'MZ': 258, 'TK': 690, 'PM': 508, 'CW': 599, 'HK': 852, 'LB': 961, 'SY': 963, 'LC': 1, 'IE': 353, 'RW': 250, 'NL': 31, 'MA': 212, 'GM': 220, 'IR': 98, 'AT': 43, 'SZ': 268, 'GT': 502, 'MT': 356, 'BQ': 599, 'MX': 52, 'NC': 687, 'CK': 682, 'SI': 386, 'VE': 58, 'IM': 44, 'AM': 374, 'SD': 249, 'LY': 218, 'LI': 423, 'TN': 216, 'UG': 256, 'RU': 7, 'DJ': 253, 'IL': 972, 'TM': 993, 'BF': 226, 'GF': 594, 'TO': 676, 'GI': 350, 'MH': 692, 'UZ': 998, 'PF': 689, 'KZ': 7, 'GA': 241, 'PE': 51, 'TV': 688, 'BT': 975, 'MQ': 596, 'MF': 590, 'AF': 93, 'IN': 91, 'AX': 358, 'BH': 973, 'JM': 1, 'MY': 60, 'BO': 591, 'AI': 1, 'SR': 597, 'ET': 251, 'ES': 34, 'TF': 0, 'GU': 1, 'BJ': 229, 'SS': 211, 'KE': 254, 'BZ': 501, 'IO': 246, 'MU': 230, 'CL': 56, 'MD': 373, 'LU': 352, 'TJ': 992, 'EC': 593, 'VG': 1, 'NZ': 64, 'VU': 678, 'FO': 298, 'LR': 231, 'AL': 355, 'GB': 44, 'AS': 1, 'IT': 39, 'TC': 1, 'TW': 886, 'BI': 257, 'HU': 36, 'TL': 670, 'GG': 44, 'PN': 0, 'SG': 65, 'LS': 266, 'KH': 855, 'FR': 33, 'BV': 0, 'CX': 61, 'AE': 971, 'LT': 370, 'PT': 351, 'KR': 82, 'BB': 1, 'TG': 228, 'AQ': 0, 'EH': 212, 'AG': 1, 'VN': 84, 'CI': 225, 'BS': 1, 'GL': 299, 'MW': 265, 'NU': 683, 'NF': 672, 'LK': 94, 'MS': 1, 'GP': 590, 'NP': 977, 'PW': 680, 'PK': 92, 'WF': 681, 'BA': 387, 'KM': 269, 'JO': 962, 'CU': 53, 'GR': 30, 'YT': 262, 'RS': 381, 'NA': 264, 'ST': 239, 'SC': 248, 'CN': 86, 'CD': 243, 'GS': 0, 'KW': 965, 'MM': 95, 'AO': 244, 'MV': 960, 'UA': 380, 'TT': 1, 'FK': 500, 'WS': 685, 'CZ': 420, 'PH': 63, 'VI': 1, 'TZ': 255, 'MR': 222, 'MC': 377, 'SN': 221, 'HT': 509, 'VC': 1, 'NI': 505, 'GD': 1, 'GY': 592, 'TH': 66}
I have just found a python library that must be perfect for your problem. It's called PhoneISO3166. This is the github link: GitHub phoneiso3166