I have a column (second column called second_column) in my csv file which represents à list of characters and its positions as follow: the column called character_position
Each line of this column contains a list of character_position . overall l have 300 lines in this column each with list of character position
character_position = [['1', 1890, 1904, 486, 505, '8', 1905, 1916, 486, 507, '4', 1919, 1931, 486, 505, '1', 1935, 1947, 486, 505, '7', 1950, 1962, 486, 505, '2', 1965, 1976, 486, 505, '9', 1980, 1992, 486, 507, '6', 1995, 2007, 486, 505, '/', 2010, 2022, 484, 508, '4', 2025, 2037, 486, 505, '8', 2040, 2052, 486, 505, '3', 2057, 2067, 486, 507, '3', 2072, 2082, 486, 505, '0', 2085, 2097, 486, 507, '/', 2100, 2112, 484, 508, 'Q', 2115, 2127, 486, 507, '1', 2132, 2144, 486, 505, '7', 2147, 2157, 486, 505, '9', 2162, 2174, 486, 505, '/', 2175, 2189, 484, 508, 'C', 2190, 2204, 487, 505, '4', 2207, 2219, 486, 505, '1', 2241, 2253, 486, 505, '/', 2255, 2268, 484, 508, '1', 2271, 2285, 486, 507, '5', 2288, 2297, 486, 505], ['D', 2118, 2132, 519, 535, '.', 2138, 2144, 529, 534, '2', 2150, 2162, 516, 535, '0', 2165, 2177, 516, 535, '4', 2180, 2192, 516, 534, '7', 2196, 2208, 516, 534, '0', 2210, 2223, 514, 535, '1', 2226, 2238, 516, 534, '8', 2241, 2253, 514, 534, '2', 2256, 2267, 514, 535, '4', 2270, 2282, 516, 534, '0', 2285, 2298, 514, 535]]
each character has for values : left, top, right, bottom. For instance character '1' has left=1890, top=1904, right=486, bottom=505.
My file whole csv file is as follow :
df = pd.read_csv(filepath_or_buffer='list_characters.csv', header=None, usecols=[1], names=['character_position])
From this file l created a new csv file with five columns :
column 1: character, column 2 : left , column 3 : top, column 4 : right, column 5 : bottom.
cols = ['char','left','top','right','bottom']
df1 = df.character_position.str.strip('[]').str.split(', ', expand=True)
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)
df1.columns = cols
df1[cols[1:]] = df1[cols[1:]].astype(int)
print (df1)
char left top right bottom
0 'm' 38 104 2456 2492
1 'i' 40 102 2442 222
2 '.' 203 213 191 198
3 '3' 235 262 131 3333
4 'A' 275 347 147 239
5 'M' 363 465 145 3334
6 'A' 73 91 373 394
7 'D' 93 112 373 39
8 'D' 454 473 663 685
9 'O' 474 495 664 33
10 'A' 108 129 727 751
11 'V' 129 150 727 444
l want to add 2 other column called line_number and all_chars_in_same_row
1)line_number corresponds to the line where for example 'm' 38 104 2456 2492 is extracted let say from line 2
2) all_chars_in_same_row corresponds to all (spaced) characters which are in the same row. for instance
character_position = [['1', 1890, 1904, 486, 505, '8', 1905, 1916, 486, 507, '4', 1919, 1931, 486, 505, '1', 1935, 1947, 486, 505, '7', 1950, 1962, 486, 505, '2', 1965, 1976, 486, 505, '9', 1980, 1992, 486, 507, '6', 1995, 2007, 486, 505, '/', 2010, 2022, 484, 508, '4', 2025, 2037, 486, 505, '8', 2040, 2052, 486, 505, '3', 2057, 2067, 486, 507, '3', 2072, 2082, 486, 505, '0', 2085, 2097, 486, 507, '/', 2100, 2112, 484, 508, 'Q', 2115, 2127, 486, 507, '1', 2132, 2144, 486, 505, '7', 2147, 2157, 486, 505, '9', 2162, 2174, 486, 505, '/', 2175, 2189, 484, 508, 'C', 2190, 2204, 487, 505, '4', 2207, 2219, 486, 505, '1', 2241, 2253, 486, 505, '/', 2255, 2268, 484, 508, '1', 2271, 2285, 486, 507, '5', 2288, 2297, 486, 505], ['D', 2118, 2132, 519, 535, '.', 2138, 2144, 529, 534, '2', 2150, 2162, 516, 535, '0', 2165, 2177, 516, 535, '4', 2180, 2192, 516, 534, '7', 2196, 2208, 516, 534, '0', 2210, 2223, 514, 535, '1', 2226, 2238, 516, 534, '8', 2241, 2253, 514, 534, '2', 2256, 2267, 514, 535, '4', 2270, 2282, 516, 534, '0', 2285, 2298, 514, 535]]
l get '1' '8' '4' '1' '7' and so on.
more formally :
all_chars_in_same_row means: write all the character of the given row in line_number column
char left top right bottom line_number all_chars_in_same_row
0 'm' 38 104 2456 2492 from line 2 'm' '2' '5' 'g'
1 'i' 40 102 2442 222 from line 4
2 '.' 203 213 191 198 from line 6
3 '3' 235 262 131 3333
4 'A' 275 347 147 239
5 'M' 363 465 145 3334
6 'A' 73 91 373 394
7 'D' 93 112 373 39
8 'D' 454 473 663 685
9 'O' 474 495 664 33
10 'A' 108 129 727 751
11 'V' 129 150 727 444
EDIT1:
import pandas as pd
df_data=pd.read_csv('/home/ahmed/internship/cnn_ocr/list_characters.csv')
df_data.shape
(50, 3)
df_data.icol(1)
0 [['m', 38, 104, 2456, 2492, 'i', 40, 102, 2442...
1 [['.', 203, 213, 191, 198, '3', 235, 262, 131,...
2 [['A', 275, 347, 147, 239, 'M', 363, 465, 145,...
3 [['A', 73, 91, 373, 394, 'D', 93, 112, 373, 39...
4 [['D', 454, 473, 663, 685, 'O', 474, 495, 664,...
5 [['A', 108, 129, 727, 751, 'V', 129, 150, 727,...
6 [['N', 34, 51, 949, 970, '/', 52, 61, 948, 970...
7 [['S', 1368, 1401, 43, 85, 'A', 1406, 1446, 43...
8 [['S', 1437, 1457, 112, 138, 'o', 1458, 1476, ...
9 [['h', 1686, 1703, 315, 339, 't', 1706, 1715, ...
10 [['N', 1331, 1349, 370, 391, 'C', 1361, 1379, ...
11 [['N', 1758, 1775, 370, 391, 'D', 1785, 1803, ...
12 [['D', 2166, 2184, 370, 391, 'A', 2186, 2205, ...
13 [['2', 1395, 1415, 427, 454, '0', 1416, 1434, ...
14 [['I', 1533, 1545, 487, 541, 'I', 1548, 1551, ...
15 [['P', 1659, 1677, 490, 514, '2', 1680, 1697, ...
16 [['1', 1890, 1904, 486, 505, '8', 1905, 1916, ...
17 [['B', 1344, 1361, 583, 607, 'O', 1364, 1386, ...
18 [['B', 1548, 1580, 979, 1015, 'T', 1586, 1619,...
19 [['Q', 169, 190, 1291, 1312, 'U', 192, 210, 12...
20 [['1', 296, 305, 1492, 1516, 'S', 339, 357, 14...
21 [['G', 339, 362, 1815, 1840, 'S', 365, 384, 18...
22 [['2', 1440, 1455, 2047, 2073, '9', 1458, 1475...
23 [['R', 339, 360, 2137, 2163, 'e', 363, 378, 21...
24 [['R', 339, 360, 1860, 1885, 'e', 363, 380, 18...
25 [['0', 1266, 1283, 1951, 1977, ',', 1287, 1290...
26 [['1', 2207, 2217, 1492, 1515, '0', 2225, 2240...
27 [['1', 2364, 2382, 1552, 1585], [], ['E', 2369...
28 [['S', 2369, 2382, 1833, 1866]]
29 [['0', 2243, 2259, 1951, 1977, '0', 2271, 2288...
30 [['0', 2243, 2259, 2227, 2253, '0', 2271, 2286...
31 [['D', 76, 88, 2580, 2596, 'é', 91, 100, 2580,...
32 [['ü', 1474, 1489, 2586, 2616, '3', 1541, 1557...
33 [['E', 1440, 1461, 2670, 2697, 'U', 1466, 1488...
34 [['2', 1685, 1703, 2670, 2697, '.', 1707, 1712...
35 [['1', 2202, 2213, 2668, 2695, '3', 2220, 2237...
36 [['c', 88, 118, 2872, 2902]]
37 [['N', 127, 144, 2889, 2910, 'D', 156, 175, 28...
38 [['E', 108, 129, 3144, 3172, 'C', 133, 156, 31...
39 [['5', 108, 126, 3204, 3231, '0', 129, 147, 32...
40 [[]]
41 [['1', 480, 492, 3202, 3229, '6', 500, 518, 32...
42 [['P', 217, 234, 3337, 3360, 'A', 235, 255, 33...
43 [[]]
44 [['I', 954, 963, 2892, 2934, 'M', 969, 1011, 2...
45 [['E', 1385, 1407, 2970, 2998, 'U', 1410, 1433...
46 [['T', 2067, 2084, 2889, 2911, 'O', 2088, 2106...
47 [['1', 2201, 2213, 2970, 2997, '6', 2219, 2238...
48 [['M', 1734, 1755, 3246, 3267, 'O', 1758, 1779...
49 [['L', 923, 935, 3411, 3430, 'A', 941, 957, 34...
Name: character_position, dtype: object
Then in my char.csv l do the following
df = pd.read_csv('list_characters.csv', header=None, usecols=[1], names=['character_position'])
df = df.replace(['\[','\]'], ['',''], regex=True)
cols = ['char','left','right','top','bottom']
df1 = df.positionlrtb.str.strip('[]').str.split(',', expand=True)
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)
df1.columns = cols
df1.char = df1.char.replace(['\[','\]'], ['',''], regex=True)
df1['left']=df1['left'].replace(['\[','\]'], ['',''], regex=True)
df1['top']=df1['top'].replace(['\[','\]'], ['',''], regex=True)
df1['right']=df1['right'].replace(['\[','\]'], ['',''], regex=True)
df1['bottom']=df1['bottom'].replace(['\[','\]'], ['',''], regex=True)
df1.to_csv('chars.csv')
However l don't see in your response how you added the columns from_line and all_char_in_same_rows.
when l execute your line of code :
df_data = df_data.character_position.str.strip('[]').str.split(',', expand=True)
l get the following :
df_data[0:10]
0 1 2 3 4 5 6 7 8 9 ... \
0 'm' 38 104 2456 2492 'i' 40 102 2442 2448 ...
1 '.' 203 213 191 198 '3' 235 262 131 198 ...
2 'A' 275 347 147 239 'M' 363 465 145 239 ...
3 'A' 73 91 373 394 'D' 93 112 373 396 ...
4 'D' 454 473 663 685 'O' 474 495 664 687 ...
5 'A' 108 129 727 751 'V' 129 150 727 753 ...
6 'N' 34 51 949 970 '/' 52 61 948 970 ...
7 'S' 1368 1401 43 85 'A' 1406 1446 43 85 ...
8 'S' 1437 1457 112 138 'o' 1458 1476 118 138 ...
9 'h' 1686 1703 315 339 't' 1706 1715 316 339 ...
1821 1822 1823 1824 1825 1826 1827 1828 1829 1830
0 None None None None None None None None None None
1 None None None None None None None None None None
2 None None None None None None None None None None
3 None None None None None None None None None None
4 None None None None None None None None None None
5 None None None None None None None None None None
6 None None None None None None None None None None
Here are the 10 first lines of my csv file :
character_position
0 [['m', 38, 104, 2456, 2492, 'i', 40, 102, 2442, 2448, 'i', 40, 100, 2402, 2410, 'l', 40, 102, 2372, 2382, 'm', 40, 102, 2312, 2358, 'u', 40, 102, 2292, 2310, 'i', 40, 104, 2210, 2260, 'l', 40, 104, 2180, 2208, 'i', 40, 104, 2140, 2166, 'l', 40, 104, 2124, 2134]]
1 [['.', 203, 213, 191, 198, '3', 235, 262, 131, 198]]
2 [['A', 275, 347, 147, 239, 'M', 363, 465, 145, 239, 'S', 485, 549, 145, 243, 'U', 569, 631, 145, 241, 'N', 657, 733, 145, 239]]
3 [['A', 73, 91, 373, 394, 'D', 93, 112, 373, 396, 'R', 115, 133, 373, 396, 'E', 136, 153, 373, 396, 'S', 156, 172, 373, 396, 'S', 175, 192, 373, 396, 'E', 195, 211, 373, 396, 'D', 222, 241, 373, 396, 'E', 244, 261, 373, 396, 'L', 272, 285, 375, 396, 'I', 288, 293, 375, 396, 'V', 296, 314, 375, 396, 'R', 317, 334, 373, 396, 'A', 334, 354, 375, 396, 'I', 357, 360, 373, 396, 'S', 365, 381, 373, 396, 'O', 384, 405, 373, 396, 'N', 408, 425, 373, 394]]
4 [['D', 454, 473, 663, 685, 'O', 474, 495, 664, 687, 'C', 498, 516, 664, 687, 'U', 519, 536, 663, 687, 'M', 540, 561, 663, 687, 'E', 564, 581, 663, 685, 'N', 584, 600, 664, 685, 'T', 603, 618, 663, 685]]
5 [['A', 108, 129, 727, 751, 'V', 129, 150, 727, 753, 'O', 153, 175, 727, 753, 'I', 178, 183, 727, 751, 'R', 187, 210, 727, 751, 'S', 220, 240, 727, 753, 'U', 243, 263, 727, 753, 'R', 267, 288, 727, 751, 'F', 302, 318, 727, 751, 'A', 320, 341, 727, 751, 'C', 342, 363, 726, 751, 'T', 366, 384, 726, 750, 'U', 387, 407, 727, 751, 'R', 411, 432, 727, 751, 'E', 435, 453, 726, 751, 'P', 797, 815, 727, 751, 'A', 818, 839, 727, 751, 'G', 840, 863, 727, 751, 'E', 867, 885, 726, 751, '1', 900, 911, 727, 751, '1', 926, 934, 727, 751, '1', 947, 956, 727, 751, '5', 962, 979, 727, 751], ['R', 120, 142, 778, 807, 'T', 144, 165, 778, 805, 'T', 178, 199, 778, 805, 'e', 201, 219, 786, 807, 'c', 222, 240, 786, 807, 'h', 241, 258, 778, 807, 'n', 263, 279, 786, 807, 'i', 284, 287, 778, 805, 'c', 291, 308, 786, 807, 'a', 309, 327, 786, 807, 'R', 350, 374, 778, 807, 'e', 377, 395, 786, 807, 't', 396, 405, 780, 805, 'u', 408, 425, 786, 807, 'r', 429, 440, 786, 807, 'n', 441, 458, 786, 807, '-', 471, 482, 793, 798, 'D', 497, 518, 778, 807, 'O', 522, 548, 777, 807, 'A', 549, 573, 778, 807, '/', 585, 596, 778, 807, 'D', 606, 630, 778, 807, 'A', 632, 656, 778, 807, 'P', 659, 680, 778, 805]]
6 [['N', 34, 51, 949, 970, '/', 52, 61, 948, 970, 'C', 63, 81, 948, 970, 'O', 84, 103, 948, 970, 'M', 106, 127, 949, 970, 'M', 130, 151, 948, 970, 'A', 153, 172, 949, 970, 'N', 175, 192, 949, 970, 'D', 195, 213, 948, 970, 'E', 217, 232, 948, 970], ['1', 73, 84, 993, 1020, '1', 94, 105, 993, 1020, '8', 112, 130, 991, 1020, '4', 135, 153, 993, 1018, '5', 156, 172, 994, 1018, '7', 175, 192, 993, 1018, '6', 195, 213, 993, 1020, '0', 216, 235, 991, 1020, '6', 238, 257, 993, 1020, '5', 260, 278, 993, 1020, '0', 407, 425, 991, 1020, '9', 428, 446, 991, 1020, '.', 450, 455, 1015, 1020, '0', 459, 477, 991, 1020, '1', 485, 494, 994, 1018, '.', 503, 507, 1015, 1020, '2', 512, 530, 991, 1020, '0', 533, 551, 991, 1020, '1', 555, 566, 993, 1020, '5', 575, 593, 993, 1020, 'R', 632, 656, 991, 1020, 'M', 659, 684, 991, 1020, 'A', 689, 713, 991, 1020, 'N', 726, 747, 993, 1020, 'o', 752, 770, 999, 1020, '.', 774, 779, 1015, 1020, '5', 794, 812, 993, 1020, '8', 815, 833, 991, 1020, '4', 834, 852, 993, 1017, '4', 857, 873, 994, 1018, '3', 878, 896, 991, 1020, '8', 899, 917, 991, 1020, '0', 920, 938, 991, 1020, '/', 950, 960, 991, 1020, '0', 971, 990, 993, 1020, '7', 995, 1011, 993, 1018, '1', 1016, 1026, 993, 1018, '6', 1034, 1052, 993, 1020, '7', 1055, 1073, 993, 1020, '4', 1076, 1094, 993, 1018, '8', 1098, 1116, 991, 1020, '9', 1119, 1137, 991, 1020, '0', 1140, 1158, 993, 1020, '9', 1160, 1178, 991, 1020], ['N', 34, 51, 1045, 1066, '/', 54, 61, 1045, 1066, 'B', 63, 79, 1044, 1066, 'O', 82, 102, 1044, 1066, 'N', 105, 121, 1045, 1066, 'D', 133, 151, 1045, 1066, 'E', 156, 172, 1044, 1066, 'L', 183, 196, 1045, 1066, 'I', 199, 204, 1045, 1066, 'V', 205, 223, 1045, 1066, 'R', 226, 244, 1045, 1066, 'A', 246, 266, 1045, 1066, 'I', 267, 272, 1045, 1066, 'S', 275, 291, 1044, 1066, 'O', 294, 314, 1045, 1066, 'N', 318, 335, 1045, 1066], ['8', 72, 90, 1093, 1122, '2', 93, 109, 1093, 1122, '5', 114, 132, 1095, 1122, '9', 135, 153, 1093, 1122, '7', 154, 172, 1095, 1122, '1', 178, 189, 1093, 1122, '3', 196, 214, 1093, 1122, '1', 220, 231, 1095, 1122, '0', 238, 257, 1093, 1122, '3', 260, 278, 1093, 1122, '0', 407, 425, 1093, 1122, '6', 429, 447, 1095, 1122, '.', 452, 455, 1117, 1122, '0', 459, 477, 1093, 1122, '2', 480, 498, 1093, 1122, '.', 503, 507, 1117, 1122, '2', 512, 530, 1093, 1122, '0', 533, 551, 1093, 1122, '1', 557, 567, 1095, 1122, '5', 575, 593, 1095, 1122], ['v', 70, 90, 1150, 1171, '/', 88, 97, 1150, 1171, 'r', 100, 118, 1150, 1171, 'é', 121, 136, 1144, 1173, 'f', 141, 156, 1150, 1171, 'ê', 159, 174, 1144, 1173, 'r', 177, 195, 1150, 1173, 'e', 198, 214, 1150, 1171, 'n', 217, 234, 1150, 1171, 'c', 238, 257, 1149, 1171, 'e', 260, 276, 1149, 1173, 'B', 476, 497, 1152, 1179, 'O', 501, 527, 1149, 1179, 'G', 530, 555, 1150, 1180, 'D', 560, 582, 1152, 1179, 'O', 585, 611, 1149, 1179, 'A', 614, 638, 1150, 1179, '1', 642, 653, 1152, 1179, '5', 659, 677, 1153, 1180, 'B', 681, 701, 1152, 1179, 'T', 705, 726, 1152, 1179, '0', 728, 746, 1152, 1179, '6', 749, 767, 1152, 1179]]
7 [['S', 1368, 1401, 43, 85, 'A', 1406, 1446, 43, 85, 'M', 1451, 1491, 36, 85, 'S', 1500, 1533, 43, 85, 'U', 1539, 1574, 43, 85, 'N', 1581, 1616, 43, 85, 'G', 1623, 1662, 42, 85, 'E', 1686, 1719, 43, 85, 'L', 1725, 1755, 43, 85, 'E', 1763, 1794, 42, 85, 'C', 1800, 1836, 43, 85, 'T', 1841, 1874, 42, 85, 'R', 1880, 1914, 42, 84, 'O', 1919, 1959, 42, 85, 'N', 1965, 1998, 42, 84, 'I', 2007, 2016, 42, 84, 'C', 2022, 2058, 42, 84, 'S', 2066, 2099, 42, 84, 'F', 2121, 2151, 42, 84, 'R', 2159, 2193, 42, 84, 'A', 2198, 2237, 40, 84, 'N', 2243, 2277, 40, 84, 'C', 2285, 2321, 42, 84, 'E', 2328, 2360, 40, 84]]
8 [['S', 1437, 1457, 112, 138, 'o', 1458, 1476, 118, 138, 'c', 1479, 1493, 120, 138, 'i', 1494, 1499, 112, 136, 'é', 1503, 1518, 114, 138, 't', 1520, 1527, 115, 138, 'é', 1530, 1547, 112, 138, 'p', 1559, 1575, 120, 144, 'a', 1577, 1593, 118, 138, 'r', 1596, 1607, 118, 136, 'A', 1616, 1637, 112, 136, 'c', 1640, 1653, 118, 138, 't', 1655, 1664, 115, 136, 'i', 1665, 1670, 112, 136, 'o', 1673, 1688, 118, 138, 'n', 1692, 1707, 118, 136, 's', 1710, 1725, 118, 138, 'S', 1736, 1755, 112, 138, 'i', 1760, 1763, 112, 136, 'm', 1767, 1791, 118, 136, 'p', 1794, 1811, 118, 142, 'l', 1812, 1817, 112, 136, 'i', 1821, 1824, 112, 136, 'f', 1827, 1835, 112, 136, 'i', 1835, 1841, 112, 136, 'é', 1845, 1860, 112, 136, 'e', 1863, 1878, 118, 136, 'a', 1890, 1907, 118, 138, 'u', 1910, 1925, 118, 136, 'C', 1937, 1958, 112, 136, 'a', 1961, 1977, 118, 136, 'p', 1980, 1995, 118, 142, 'i', 1998, 2003, 112, 136, 't', 2006, 2013, 114, 136, 'a', 2015, 2030, 118, 136, 'l', 2034, 2037, 112, 136, 'd', 2051, 2066, 111, 136, 'e', 2069, 2085, 117, 136, '2', 2097, 2112, 112, 136, '7', 2115, 2132, 111, 136, '.', 2136, 2139, 132, 136, '0', 2144, 2159, 111, 136, '0', 2162, 2178, 111, 136, '0', 2180, 2196, 111, 136, '.', 2201, 2205, 132, 135, '0', 2208, 2225, 111, 136, '0', 2228, 2243, 111, 136, '0', 2246, 2261, 111, 136, 't', 2273, 2281, 112, 135, 'i', 2281, 2291, 111, 136], ['1', 1473, 1482, 153, 177, ',', 1491, 1494, 172, 181, 'r', 1508, 1517, 159, 177, 'u', 1520, 1535, 160, 177, 'e', 1538, 1554, 159, 177, 'F', 1566, 1583, 153, 177, 'r', 1587, 1596, 159, 177, 'u', 1598, 1613, 159, 177, 'c', 1617, 1631, 159, 177, 't', 1634, 1641, 154, 177, 'i', 1643, 1646, 153, 177, 'd', 1650, 1665, 151, 177, 'o', 1668, 1685, 159, 177, 'r', 1688, 1697, 159, 177, 'C', 1709, 1730, 153, 177, 'S', 1733, 1751, 153, 177, '2', 1764, 1779, 153, 177, '0', 1781, 1797, 153, 177, '0', 1800, 1817, 153, 177, '3', 1820, 1835, 151, 177, '9', 1847, 1863, 151, 177, '3', 1866, 1883, 151, 177, '4', 1883, 1901, 153, 175, '8', 1904, 1919, 151, 177, '4', 1919, 1937, 153, 175, 'S', 1950, 1968, 151, 177, 'A', 1971, 1992, 151, 175, 'I', 1995, 2000, 151, 175, 'N', 2004, 2024, 151, 175, 'T', 2027, 2046, 151, 175, 'O', 2058, 2081, 151, 177, 'U', 2085, 2105, 151, 177, 'E', 2109, 2127, 151, 177, 'N', 2130, 2150, 151, 175, 'C', 2163, 2186, 151, 175, 'e', 2187, 2204, 157, 175, 'd', 2207, 2222, 150, 175, 'e', 2225, 2240, 157, 175, 'x', 2243, 2258, 157, 175], ['T', 1638, 1656, 192, 216, 'É', 1659, 1677, 186, 217, 'L', 1682, 1697, 193, 217, 'É', 1701, 1719, 187, 217, 'P', 1722, 1742, 192, 217, 'H', 1746, 1766, 193, 217, 'O', 1770, 1793, 192, 217, 'N', 1796, 1815, 192, 216, 'E', 1820, 1838, 192, 217, '0', 1869, 1886, 190, 216, '1', 1890, 1899, 192, 216, '4', 1914, 1931, 193, 216, '4', 1934, 1950, 193, 216, '0', 1961, 1977, 190, 216, '4', 1980, 1997, 193, 216, '7', 2009, 2024, 192, 216, '0', 2027, 2042, 192, 216, '0', 2055, 2070, 192, 216, '0', 2073, 2090, 192, 216], ['R', 1517, 1538, 232, 258, '.', 1542, 1545, 253, 256, 'C', 1550, 1571, 232, 256, '.', 1575, 1580, 252, 256, 'S', 1584, 1602, 232, 256, '.', 1607, 1611, 252, 256, 'B', 1625, 1643, 232, 256, 'O', 1649, 1670, 231, 258, 'B', 1674, 1692, 232, 256, 'I', 1697, 1701, 232, 256, 'G', 1706, 1728, 232, 256, 'N', 1731, 1751, 232, 256, 'Y', 1754, 1775, 232, 256, 'B', 1788, 1806, 232, 256, '3', 1818, 1835, 231, 256, '3', 1838, 1855, 231, 256, '4', 1855, 1872, 232, 255, '3', 1884, 1899, 232, 256, '6', 1904, 1919, 232, 256, '7', 1922, 1937, 232, 256, '4', 1947, 1964, 232, 256, '9', 1967, 1983, 232, 256, '7', 1986, 2001, 232, 256, '-', 2013, 2022, 244, 249, 'A', 2034, 2055, 231, 255, 'P', 2057, 2075, 231, 255, 'E', 2079, 2097, 231, 256, '4', 2109, 2126, 232, 255, '6', 2129, 2145, 232, 256, '5', 2148, 2163, 232, 256, '2', 2166, 2183, 232, 255, 'Z', 2193, 2211, 231, 255], ['C', 1628, 1647, 271, 297, 'o', 1652, 1670, 279, 297, 'd', 1671, 1689, 273, 297, 'e', 1692, 1709, 279, 298, 'T', 1721, 1739, 273, 297, 'V', 1742, 1763, 273, 297, 'A', 1763, 1787, 273, 297, 'F', 1818, 1835, 273, 297, 'R', 1839, 1859, 273, 297, '8', 1872, 1889, 273, 297, '9', 1890, 1905, 273, 297, '3', 1919, 1932, 273, 297, '3', 1937, 1952, 273, 297, '4', 1953, 1971, 273, 297, '3', 1983, 1998, 273, 297, '6', 2001, 2018, 273, 297, '7', 2021, 2036, 273, 295, '4', 2048, 2064, 274, 297, '9', 2066, 2082, 273, 297, '7', 2085, 2100, 273, 295]]
9 [['h', 1686, 1703, 315, 339, 't', 1706, 1715, 316, 339, 't', 1718, 1727, 316, 339, 'p', 1730, 1748, 321, 345, 'i', 1751, 1757, 321, 339, 'f', 1760, 1769, 315, 339, '/', 1769, 1776, 313, 339, 'w', 1779, 1804, 321, 337, 'w', 1804, 1829, 321, 339, 'w', 1830, 1854, 321, 337, '.', 1859, 1863, 333, 337, 's', 1868, 1883, 319, 339, 'a', 1886, 1901, 321, 337, 'm', 1905, 1929, 321, 337, 's', 1932, 1949, 321, 339, 'u', 1953, 1968, 321, 339, 'n', 1973, 1989, 321, 339, 'g', 1992, 2010, 319, 345, '.', 2015, 2019, 333, 337, 'f', 2021, 2033, 313, 337, 'r', 2034, 2045, 319, 337]]
10 [['N', 1331, 1349, 370, 391, 'C', 1361, 1379, 370, 393, 'O', 1382, 1403, 370, 393, 'M', 1404, 1425, 370, 391, 'P', 1430, 1446, 370, 391, 'T', 1448, 1464, 370, 391, 'E', 1467, 1484, 370, 393, 'C', 1494, 1512, 370, 393, 'L', 1515, 1532, 370, 393, 'I', 1533, 1539, 370, 393, 'E', 1542, 1559, 370, 393, 'N', 1560, 1580, 370, 393, 'T', 1580, 1598, 370, 393]]
here is the second csv file:
char left right top bottom
0 'm' 38 104 2456 2492
1 'i' 40 102 2442 2448
2 'i' 40 100 2402 2410
3 'l' 40 102 2372 2382
4 'm' 40 102 2312 2358
5 'u' 40 102 2292 2310
6 'i' 40 104 2210 2260
7 'l' 40 104 2180 2208
8 'i' 40 104 2140 2166
EDIT1
HERE IS MY output for solution 2 (`input character_position described` )
1831 1830 level_2 char left top right bottom FromLine all_chars_in_same_row
0 0 character_position 0 character_position 0 character_position
1 1 'm','i','i','l','m','u','i','l','i','l' 0 'm' 38 104 2456 2492 1 'm','i','i','l','m','u','i','l','i','l'
2 1 'm','i','i','l','m','u','i','l','i','l' 1 'i' 40 102 2442 2448 1 'm','i','i','l','m','u','i','l','i','l'
3 1 'm','i','i','l','m','u','i','l','i','l' 2 'i' 40 100 2402 2410 1 'm','i','i','l','m','u','i','l','i','l'
l think the probelm comes from the fact that l have in my data :
[[',' , 'A', ',' , '.', ':' , ';', '1'], [], ['m', 'a',]]
so :
empty `[ ]` causes problem for the order. l noticed that when l tried to omit all [] which are empty beacause l find my csv as follow :
in char : ['a' rather than 'a' for values 8794] rather than 8794 or
[5345 rather than 5345
so processed the csv as follow
df = pd.read_csv(filepath_or_buffer='lit_charaters.csv', header=None, usecols=[1,3], names=['character_position','LineIndex'])
df = df.replace(['\[','\]'], ['',''], regex=True)
cols = ['char','left','right','top','bottom','LineIndex']
df1 = df.positionlrtb.str.strip('[]').str.split(',', expand=True)
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)
df1.columns = cols
df1.char = df1.char.replace(['\[','\]'], ['',''], regex=True)
df1['left']=df1['left'].replace(['\[','\]'], ['',''], regex=True)
df1['top']=df1['top'].replace(['\[','\]'], ['',''], regex=True)
df1['right']=df1['right'].replace(['\[','\]'], ['',''], regex=True)
df1['bottom']=df1['bottom'].replace(['\[','\]'], ['',''], regex=True)
df1.to_csv('char.csv')
Then l noticed the following
look at line 1221 column B it's empty it replaces [] then we get the disorder of columns switched (B and C) due to empty char . How to solve that ?
l have also empty line
3831 '6' 296 314 3204 3231
3832
3833 '1' 480 492 3202 3229
Line 3832 should be removed.
in order to get something like this
**EDIT2:**
In order to solve the problem of empty rows and [] in list_characters.csv
[['1', 2364, 2382, 1552, 1585], [], ['E', 2369, 2381, 1623, 1640], ['8', 2369, 2382, 1644, 1668]]
and
[[]] [[]]
l did the following :
df1 = df.applymap(lambda x: [y for y in x if len(y) > 0])
df1 = df1[df1.applymap(len).ne(0).all(axis=1)]
df1 = df.replace(['\[\],','\[\[\]\]', ''],['','', np.nan], regex=True)
df1 = df1.dropna()
then
df = pd.read_csv('character_position.csv', index_col=0)
df.positionlrtb = df.positionlrtb.apply(ast.literal_eval)
df.positionlrtb = df.positionlrtb.apply(lambda x: [y for y in x if len(y) > 0])
print (df.head())
page_number positionlrtb \
0 1841729699_001 [[m, 38, 104, 2456, 2492, i, 40, 102, 2442, 24...
1 1841729699_001 [[., 203, 213, 191, 198, 3, 235, 262, 131, 198]]
2 1841729699_001 [[A, 275, 347, 147, 239, M, 363, 465, 145, 239...
3 1841729699_001 [[A, 73, 91, 373, 394, D, 93, 112, 373, 396, R...
4 1841729699_001 [[D, 454, 473, 663, 685, O, 474, 495, 664, 687...
LineIndex
0 [[mi, il, mu, il, il]]
1 [[.3]]
2 [[amsun]]
3 [[adresse, de, livraison]]
4 [[document]]
cols = ['char','left','top','right','bottom']
df1 = pd.DataFrame({
"a": np.repeat(df.page_number.values, df.positionlrtb.str.len()),
"b": list(chain.from_iterable(df.positionlrtb))})
df1 = pd.DataFrame(df1.b.values.tolist())
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)
cols = ['char','left','top','right','bottom']
df1.columns = cols
df1[cols[1:]] = df1[cols[1:]].astype(int)
print (df1)
char left top right bottom
0 m 38 104 2456 2492
1 i 40 102 2442 2448
2 i 40 100 2402 2410
3 l 40 102 2372 2382
4 m 40 102 2312 2358
5 u 40 102 2292 2310
6 i 40 104 2210 2260
7 l 40 104 2180 2208
8 i 40 104 2140 2166
However :
df_data = df_data.character_position.str.strip('[]').str.split(', ', expand=True)
returns None values
Once you create required data frame, after stacking , don't remove Index, it holds your line number. Since this is a multilevel indexing , get the first Index- your line number.
df_data['LineIndex'] = df_data.index.get_level_values(0)
Then you can group by LineIndex column and get all characters for the common LineIndex. This is created as dictionary. Convert this dictionary into data frame and then finally merge this into the actual data
Solution 1
import pandas as pd
df_data=pd.read_csv('list_characters.csv' , header=None, usecols=[1], names=['character_position'])
df_data = df_data.character_position.str.strip('[]').str.split(', ', expand=True)
df_data.columns = [df_data.columns % 5, df_data.columns // 5]
df_data = df_data.stack() # dont remove Index, it has the line from where this record was created
print df_data
df_data['FromLine'] = df_data.index.get_level_values(0) #assign line number to a column
cols = ['char','left','top','right','bottom','FromLine']
df_data.columns = cols #assign the new column names
#create a new dictionary
#it contains the line number as key and all the characters from that line as value
DictChar= {k: list(v) for k,v in df_data.groupby("FromLine")["char"]}
#convert dictionary to a dataframe
df_chars=pd.DataFrame(DictChar.items())
df_chars.columns=cols = ['FromLine','char']
# Merge dataframes on column 'FromLine'
df_final=df_data.merge(df_chars,on ='FromLine')
cols = ['char','left','top','right','bottom','FromLine','all_chars_in_same_row']
df_final.columns=cols
print df_final
Solution 2
I personally prefer this solution over the first one. See inline comments for more details
import pandas as pd
df_data=pd.read_csv('list_characters.csv', header=None, usecols=[1], names=['character_position'])
df_data = df_data.character_position.str.strip('[]').str.split(', ', expand=True)
x=len(df_data.columns) #get total number of columns
#get all characters from every 5th column, concatenate and create new column in df_data
df_data[x] = df_data[df_data.columns[::5]].apply(lambda x: ','.join(x.dropna()), axis=1)
# get index of each row. This is the line number for your record
df_data[x+1]=df_data.index.get_level_values(0)
# now set line number and character columns as Index of data frame
df_data.set_index([x+1,x],inplace=True,drop=True)
df_data.columns = [df_data.columns % 5, df_data.columns // 5]
df_data = df_data.stack()
df_data['FromLine'] = df_data.index.get_level_values(0) #assign line number to a column
df_data['all_chars_in_same_row'] = df_data.index.get_level_values(1) #assign character values to a column
cols = ['char','left','top','right','bottom','FromLine','all_chars_in_same_row']
df_data.columns=cols
df_data.reset_index(inplace=True) #remove mutiindexing
print df_data[cols]
Output
char left top right bottom from line all_chars_in_same_row
0 '.' 203 213 191 198 0 ['.', '3', 'C']
1 '3' 1758 1775 370 391 0 ['.', '3', 'C']
2 'C' 296 305 1492 1516 0 ['.', '3', 'C']
3 'A' 275 347 147 239 1 ['A', 'M', 'D']
4 'M' 2166 2184 370 391 1 ['A', 'M', 'D']
5 'D' 339 362 1815 1840 1 ['A', 'M', 'D']
6 'A' 73 91 373 394 2 ['A', 'D', 'A']
7 'D' 1395 1415 427 454 2 ['A', 'D', 'A']
8 'A' 1440 1455 2047 2073 2 ['A', 'D', 'A']
9 'D' 454 473 663 685 3 ['D', 'O', '0']
10 'O' 1533 1545 487 541 3 ['D', 'O', '0']
11 '0' 339 360 2137 2163 3 ['D', 'O', '0']
12 'A' 108 129 727 751 4 ['A', 'V', 'I']
13 'V' 1659 1677 490 514 4 ['A', 'V', 'I']
14 'I' 339 360 1860 1885 4 ['A', 'V', 'I']
15 'N' 34 51 949 970 5 ['N', '/', '2']
16 '/' 1890 1904 486 505 5 ['N', '/', '2']
17 '2' 1266 1283 1951 1977 5 ['N', '/', '2']
18 'S' 1368 1401 43 85 6 ['S', 'A', '8']
19 'A' 1344 1361 583 607 6 ['S', 'A', '8']
20 '8' 2207 2217 1492 1515 6 ['S', 'A', '8']
21 'S' 1437 1457 112 138 7 ['S', 'o', 'O']
22 'o' 1548 1580 979 1015 7 ['S', 'o', 'O']
23 'O' 1331 1349 370 391 7 ['S', 'o', 'O']
24 'h' 1686 1703 315 339 8 ['h', 't', 't']
25 't' 169 190 1291 1312 8 ['h', 't', 't']
26 't' 169 190 1291 1312 8 ['h', 't', 't']
27 'N' 1331 1349 370 391 9 ['N', 'C', 'C']
28 'C' 296 305 1492 1516 9 ['N', 'C', 'C']
29 'C' 296 305 1492 1516 9 ['N', 'C', 'C']