Accessing pandas DataFrame as a nested list - python

I've used the .set_index() function to set the first column as my index to the rows in my dataframe:
>>> import pandas as pd
>>> df = pd.DataFrame([['x', 1,2,3,4,5], ['y', 6,7,8,9,10], ['z', 11,12,13,14,15]])
>>> df.columns = ['index', 'a', 'b', 'c', 'd', 'e']
>>> df = df.set_index(['index'])
>>> df
a b c d e
index
x 1 2 3 4 5
y 6 7 8 9 10
z 11 12 13 14 15
How should the dataframe be manipulated such that I can be accessed like a nested list? e.g. are the following possible:
>>> df['x']
[1, 2, 3, 4, 5]
>>> df['x']['a']
1
>>> df['x']['a', 'b']
(1, 2)
>>> df['x']['a', 'd', 'c']
(1, 4, 3)
I've tried accessing df['x'] after setting the index but it throws an error, is that the correct way to access the x row?
>>> import pandas as pd
>>> df = pd.DataFrame([['x', 1,2,3,4,5], ['y', 6,7,8,9,10], ['z', 11,12,13,14,15]])
>>> df.columns = ['index', 'a', 'b', 'c', 'd', 'e']
>>> df = df.set_index(['index'])
>>> df
a b c d e
index
x 1 2 3 4 5
y 6 7 8 9 10
z 11 12 13 14 15
>>> df['x']
Traceback (most recent call last):
File "/usr/local/lib/python3.5/site-packages/pandas/core/indexes/base.py", line 2393, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5239)
File "pandas/_libs/index.pyx", line 154, in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5085)
File "pandas/_libs/hashtable_class_helper.pxi", line 1207, in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20405)
File "pandas/_libs/hashtable_class_helper.pxi", line 1215, in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20359)
KeyError: 'x'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/site-packages/pandas/core/frame.py", line 2062, in __getitem__
return self._getitem_column(key)
File "/usr/local/lib/python3.5/site-packages/pandas/core/frame.py", line 2069, in _getitem_column
return self._get_item_cache(key)
File "/usr/local/lib/python3.5/site-packages/pandas/core/generic.py", line 1534, in _get_item_cache
values = self._data.get(item)
File "/usr/local/lib/python3.5/site-packages/pandas/core/internals.py", line 3590, in get
loc = self.items.get_loc(item)
File "/usr/local/lib/python3.5/site-packages/pandas/core/indexes/base.py", line 2395, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5239)
File "pandas/_libs/index.pyx", line 154, in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5085)
File "pandas/_libs/hashtable_class_helper.pxi", line 1207, in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20405)
File "pandas/_libs/hashtable_class_helper.pxi", line 1215, in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20359)
KeyError: 'x'

You can use loc.
From your examples:
df['x'] should be df.loc['x']
df['x']['a'] should be df.loc['x', 'a'], and
df['x']['a', 'd', 'c'] should be df.loc['x', ['a', 'd', 'c']]

Accessing rows should be made using loc:
df.loc['x']
Getting row and column should be
df.loc['x', ['a', 'c']]
or you can get the transpose
df.T.x

Related

Find the product of the minimum height of defenders lower than 180 cm and the maximum height of midfielders higher than 185 cm

positions = ['GK', 'M', 'A', 'D', 'M', 'D', 'M', 'M', 'M', 'A', 'M', 'M', 'A', 'A', 'A', 'M', 'D', 'A', 'D', 'M', 'GK', 'D', 'D', 'M', 'M', 'M', 'M', 'D', 'M', 'GK', 'D', 'GK', 'D', 'D', 'M']
heights = [191, 184, 185, 180, 181, 187, 170, 179, 183, 186, 185, 170, 187, 183, 173, 188, 183, 180, 188, 175, 193, 180, 185, 170, 183, 173, 185, 185, 168, 190, 178, 185, 185, 193, 183]
np_positions = np.array(positions)
np_heights = np.array(heights)
My code is:
print(np.min(positions=='D'[heights<180]) * np.max(positions=='A'[heights > 185]))
I get a TypeError. I made it another way, but I need to do this in 1 string.
IIUC, use your numpy arrays and boolean slicing!
NB. assuming Defender is D and Midfielder is M.
# position is D/M condition on height
(np_heights[(np_positions=='D')&(np_heights<180)].min()
*np_heights[(np_positions=='M')&(np_heights>185)].max()
)
output: 33464 (178*188)
What you could do is to create a 2D list with position and height using zip.
In the min and max functions of python you are able to define your own key on what the minimum and maximum value should be searched for. In this case I used a lambda function using an if else statement.
positions = ['GK', 'M', 'A', 'D', 'M', 'D', 'M', 'M', 'M', 'A', 'M', 'M', 'A', 'A', 'A', 'M', 'D', 'A', 'D', 'M', 'GK', 'D', 'D', 'M', 'M', 'M', 'M', 'D', 'M', 'GK', 'D', 'GK', 'D', 'D', 'M']
heights = [191, 184, 185, 180, 181, 187, 170, 179, 183, 186, 185, 170, 187, 183, 173, 188, 183, 180, 188, 175, 193, 180, 185, 170, 183, 173, 185, 185, 168, 190, 178, 185, 185, 193, 183]
# create a 2d array:
d = list(zip(positions, heights))
min_height_defender = min(d, key=lambda x: x[1] if x[0] == 'D' and x[1] < 180 else 1e9)
max_height_midfielder = max(d, key=lambda x: x[1] if x[0] == 'A' and x[1] > 185 else -1e9)
print( min_height_defender[1] * max_height_midfielder[1] )
>>> 33286
Or you can use numpy, but I would suggest that you test your minimum and maximum value after filtering, otherwise your code becomes unreadable:
# or with numpy:
positions = np.array(positions)
heights = np.array(heights)
print( heights[np.argwhere(positions == 'D')].min() * heights[np.argwhere(positions == 'A')].max() )
>>> 33286
# or without argwhere:
print( heights[positions == 'D'].min() * heights[positions == 'A'].max() )
>>> 33286
If you want to filter inline you can use it, but as said it is very ugly (in my opinion) and you should avoid to do more than 1 thing per line of code. But if you want of just for the kick of doing things in one line:
print( heights[heights<180][positions[heights<180] == 'D'].min() * heights[heights>185][positions[heights>185] == 'A'].max() )
>>> 33286

Pandas DF to a nested list

I have a dataframe (df) and I want to transform it to a nested list.
df=pd.DataFrame({'Number':[1,2,3,4,5, 6],
'Name':['A', 'B', 'C', 'D', 'E', 'F'],
'Value': [223, 124, 434, 514, 821, 110]})
My expected outcome is a nested list. The first list inside the nested takes values from the first 3 rows of df from the first column. The second then the first 3 rows of the second column and the third the 3 first rows of the third column. After that I want to add lists of the remaning 3 rows.
[[1, 2, 3],
['A', 'B', 'C'],
[223, 124, 434]
[4, 5, 6],
['D', 'E', 'F'],
[514, 821, 110]]
I did a for loop and called tolist() on each series. Then I get all the values from one column in a list. How do I go from the outcome below to the expected outcome above?
col=df.columns
lst=[]
for i in col:
temp = df[i].tolist()
temp
lst.append(temp)
Outcome (lst):
[[1, 2, 3, 4, 5, 6],
['A', 'B', 'C', 'D', 'E', 'F'],
[223, 124, 434, 514, 821, 110]]
Use .values and some numpy slicing
v = df.values.T
v[:,:3].tolist() + v[:,3:].tolist()
output
[[1, 2, 3],
['A', 'B', 'C'],
[223, 124, 434],
[4, 5, 6],
['D', 'E', 'F'],
[514, 821, 110]]
Try:
lst = df.set_index(df.index // 3).groupby(level=0).agg(list) \
.to_numpy().ravel().tolist()
print(lst)
# Output
[[1, 2, 3],
['A', 'B', 'C'],
[223, 124, 434],
[4, 5, 6],
['D', 'E', 'F'],
[514, 821, 110]]
This is an example starting from 3 lists, the ones you got doing .tolist()
a = [1, 2, 3, 4, 5, 6, 4]
b = ['A', 'B', 'C', 'D', 'E', 'F']
c = [223, 124, 434, 514, 821, 110]
res = []
for i in range(len(a) // 3):
res.append(a[i * 3:(i * 3) + 3])
res.append(b[i * 3:(i * 3) + 3])
res.append(c[i * 3:(i * 3) + 3])
result is
[[1, 2, 3], ['A', 'B', 'C'], [223, 124, 434], [4, 5, 6], ['D', 'E', 'F'], [514, 821, 110]]
import pandas as pd
df=pd.DataFrame({
'Number':[1,2,3,4,5, 6],
'Name':['A', 'B', 'C', 'D', 'E', 'F'],
'Value': [223, 124, 434, 514, 821, 110]
})
# convert df into slices of 3
final_list = []
for i in range(0, len(df), 3):
final_list.append(
df.iloc[i:i+3]['Number'].to_list())
final_list.append(
df.iloc[i:i+3]['Name'].to_list())
final_list.append(
df.iloc[i:i+3]['Value'].to_list())
print(final_list)
output
[[1, 2, 3], ['A', 'B', 'C'], [223, 124, 434], [4, 5, 6], ['D', 'E', 'F'], [514, 821, 110]]
I think you just want to divide the list (column) into list of size n.
You can change the value of n, to change the sublist size.
lst=[]
n=3
for i in col:
temp = df[i].tolist()
for i in range(0,len(temp),n):
lst.append(temp[i:i+n])

Compare custom column values and print difference in pandas DataFrame

I have two DataFrames:
df1 = {'MS': [1000, 1005, 1007, NaN, 1010, 1012, 1020],
'Command': ['RD', 'RD', 'WR', '---', 'RD', 'RD', 'WR'],
'Data1': [100, 110, 120, NaN, 130, 140, 150],
'Data2': ['A', 'A', 'B', '--', 'A', 'B', 'B'],
'Data3': [1, 0, 0, NaN, 1, 1, 0]}
df2 = {'MS': [1001, 1006, 1010, NaN, 1003, 1015, 1020, 1030],
'Command': ['WR', 'RD', 'WR', '---', 'RD', 'RD', 'WR', 'RD'],
'Data1': [120, 110, 120, NaN, 140, 130, 150, 110],
'Data2': ['B', 'A', 'B', '--', 'B', 'A', 'B', 'A'],
'Data3': [0, 0, 1, NaN, 1, 0, 0, 0]}
I want to compare every row in 'df1' with 'df2', except 'MS' column, where 'MS' is time in milliseconds. Both the DFs have identical columns. Column 'MS' might contain NaN, which case need to be ignored.
By comparing, I want to print
Matching rows in 'df1' and 'df2', one below the other, with a new column 'Diff' having 'MS' difference between the values; from above example, row 3 in 'df1' is matching with row 1 of 'df2', so print,
MS Diff Command Data1 Data2 Data3
0 1007 NaN WR 120 B 0
1 1001 6 WR 120 B 0
Print all unmatched rows in df1 and df2
Compare function should be generic enough to accept an argument with columns of choice and compare only those values in columns to consider match or no-match. For example, every iteration I may pass different column lists,
itr1_comp_col = ['Command', 'Data1', 'Data3']
itr2_comp_col = ['Command', 'Data2', 'Data3']
For respective iterations, it shall compare only those column values of user choice.
So far I am not able to produce any satisfactory code. I am a beginner to Pandas and
I have tried grouping them by 'Command' column and concatenating two identical groups by dropping duplicates, as discussed in this thread.
I have manually looped through values in every row and compared, which is absolutely inefficient, as data is very huge, some million entries.
Please suggest an efficient way to handle above case. Thanks in advance.
I will answer my own question, wrt, #Ankur said in his comments:
Even though this doesn't print matching rows one below the other, however it partially fulfils the requirement.
Referring to this page, merge can be used to find difference in DFs. Especially, the argument how= will do the work. Below is the function:
def find_diff(df1: pd.DataFrame, df2: pd.DataFrame, col_itr):
res = pd.merge(df1, df2, on=col_itr, how='outer')
res['Diff'] = res['MS_x'] - res['MS_y']
print (res)
Usage:
import pandas as pd
import numpy as np
d1 = {'MS': [1000, 1005, 1007, np.NaN, 1010, 1012, 1020],
'Command': ['RD', 'RD', 'WR', '-', 'RD', 'RD', 'WR'],
'Data1': [100, 110, 120, np.NaN, 130, 140, 150],
'Data2': ['A', 'A', 'B', '-', 'A', 'B', 'B'],
'Data3': [1, 0, 0, np.NaN, 1, 1, 0]}
d2 = {'MS': [1001, 1006, 1010, np.NaN, 1003, 1015, 1020, 1030],
'Command': ['WR', 'RD', 'WR', '-', 'RD', 'RD', 'WR', 'RD'],
'Data1': [120, 110, 120, np.NaN, 140, 130, 150, 110],
'Data2': ['B', 'A', 'B', '-', 'B', 'A', 'B', 'A'],
'Data3': [0, 0, 1, np.NaN, 1, 0, 0, 0]}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
itr1_comp_col = ['Command', 'Data1', 'Data2', 'Data3']
itr2_comp_col = ['Command', 'Data2', 'Data3']
find_diff(df1, df2, itr1_comp_col)
find_diff(df1, df2, itr2_comp_col)

Get the row index of each extracted character from csv file

I have a column (second column called second_column) in my csv file which represents à list of characters and its positions as follow: the column called character_position
Each line of this column contains a list of character_position . overall l have 300 lines in this column each with list of character position
character_position = [['1', 1890, 1904, 486, 505, '8', 1905, 1916, 486, 507, '4', 1919, 1931, 486, 505, '1', 1935, 1947, 486, 505, '7', 1950, 1962, 486, 505, '2', 1965, 1976, 486, 505, '9', 1980, 1992, 486, 507, '6', 1995, 2007, 486, 505, '/', 2010, 2022, 484, 508, '4', 2025, 2037, 486, 505, '8', 2040, 2052, 486, 505, '3', 2057, 2067, 486, 507, '3', 2072, 2082, 486, 505, '0', 2085, 2097, 486, 507, '/', 2100, 2112, 484, 508, 'Q', 2115, 2127, 486, 507, '1', 2132, 2144, 486, 505, '7', 2147, 2157, 486, 505, '9', 2162, 2174, 486, 505, '/', 2175, 2189, 484, 508, 'C', 2190, 2204, 487, 505, '4', 2207, 2219, 486, 505, '1', 2241, 2253, 486, 505, '/', 2255, 2268, 484, 508, '1', 2271, 2285, 486, 507, '5', 2288, 2297, 486, 505], ['D', 2118, 2132, 519, 535, '.', 2138, 2144, 529, 534, '2', 2150, 2162, 516, 535, '0', 2165, 2177, 516, 535, '4', 2180, 2192, 516, 534, '7', 2196, 2208, 516, 534, '0', 2210, 2223, 514, 535, '1', 2226, 2238, 516, 534, '8', 2241, 2253, 514, 534, '2', 2256, 2267, 514, 535, '4', 2270, 2282, 516, 534, '0', 2285, 2298, 514, 535]]
each character has for values : left, top, right, bottom. For instance character '1' has left=1890, top=1904, right=486, bottom=505.
My file whole csv file is as follow :
df = pd.read_csv(filepath_or_buffer='list_characters.csv', header=None, usecols=[1], names=['character_position])
From this file l created a new csv file with five columns :
column 1: character, column 2 : left , column 3 : top, column 4 : right, column 5 : bottom.
cols = ['char','left','top','right','bottom']
df1 = df.character_position.str.strip('[]').str.split(', ', expand=True)
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)
df1.columns = cols
df1[cols[1:]] = df1[cols[1:]].astype(int)
print (df1)
char left top right bottom
0 'm' 38 104 2456 2492
1 'i' 40 102 2442 222
2 '.' 203 213 191 198
3 '3' 235 262 131 3333
4 'A' 275 347 147 239
5 'M' 363 465 145 3334
6 'A' 73 91 373 394
7 'D' 93 112 373 39
8 'D' 454 473 663 685
9 'O' 474 495 664 33
10 'A' 108 129 727 751
11 'V' 129 150 727 444
l want to add 2 other column called line_number and all_chars_in_same_row
1)line_number corresponds to the line where for example 'm' 38 104 2456 2492 is extracted let say from line 2
2) all_chars_in_same_row corresponds to all (spaced) characters which are in the same row. for instance
character_position = [['1', 1890, 1904, 486, 505, '8', 1905, 1916, 486, 507, '4', 1919, 1931, 486, 505, '1', 1935, 1947, 486, 505, '7', 1950, 1962, 486, 505, '2', 1965, 1976, 486, 505, '9', 1980, 1992, 486, 507, '6', 1995, 2007, 486, 505, '/', 2010, 2022, 484, 508, '4', 2025, 2037, 486, 505, '8', 2040, 2052, 486, 505, '3', 2057, 2067, 486, 507, '3', 2072, 2082, 486, 505, '0', 2085, 2097, 486, 507, '/', 2100, 2112, 484, 508, 'Q', 2115, 2127, 486, 507, '1', 2132, 2144, 486, 505, '7', 2147, 2157, 486, 505, '9', 2162, 2174, 486, 505, '/', 2175, 2189, 484, 508, 'C', 2190, 2204, 487, 505, '4', 2207, 2219, 486, 505, '1', 2241, 2253, 486, 505, '/', 2255, 2268, 484, 508, '1', 2271, 2285, 486, 507, '5', 2288, 2297, 486, 505], ['D', 2118, 2132, 519, 535, '.', 2138, 2144, 529, 534, '2', 2150, 2162, 516, 535, '0', 2165, 2177, 516, 535, '4', 2180, 2192, 516, 534, '7', 2196, 2208, 516, 534, '0', 2210, 2223, 514, 535, '1', 2226, 2238, 516, 534, '8', 2241, 2253, 514, 534, '2', 2256, 2267, 514, 535, '4', 2270, 2282, 516, 534, '0', 2285, 2298, 514, 535]]
l get '1' '8' '4' '1' '7' and so on.
more formally :
all_chars_in_same_row means: write all the character of the given row in line_number column
char left top right bottom line_number all_chars_in_same_row
0 'm' 38 104 2456 2492 from line 2 'm' '2' '5' 'g'
1 'i' 40 102 2442 222 from line 4
2 '.' 203 213 191 198 from line 6
3 '3' 235 262 131 3333
4 'A' 275 347 147 239
5 'M' 363 465 145 3334
6 'A' 73 91 373 394
7 'D' 93 112 373 39
8 'D' 454 473 663 685
9 'O' 474 495 664 33
10 'A' 108 129 727 751
11 'V' 129 150 727 444
EDIT1:
import pandas as pd
df_data=pd.read_csv('/home/ahmed/internship/cnn_ocr/list_characters.csv')
df_data.shape
(50, 3)
df_data.icol(1)
0 [['m', 38, 104, 2456, 2492, 'i', 40, 102, 2442...
1 [['.', 203, 213, 191, 198, '3', 235, 262, 131,...
2 [['A', 275, 347, 147, 239, 'M', 363, 465, 145,...
3 [['A', 73, 91, 373, 394, 'D', 93, 112, 373, 39...
4 [['D', 454, 473, 663, 685, 'O', 474, 495, 664,...
5 [['A', 108, 129, 727, 751, 'V', 129, 150, 727,...
6 [['N', 34, 51, 949, 970, '/', 52, 61, 948, 970...
7 [['S', 1368, 1401, 43, 85, 'A', 1406, 1446, 43...
8 [['S', 1437, 1457, 112, 138, 'o', 1458, 1476, ...
9 [['h', 1686, 1703, 315, 339, 't', 1706, 1715, ...
10 [['N', 1331, 1349, 370, 391, 'C', 1361, 1379, ...
11 [['N', 1758, 1775, 370, 391, 'D', 1785, 1803, ...
12 [['D', 2166, 2184, 370, 391, 'A', 2186, 2205, ...
13 [['2', 1395, 1415, 427, 454, '0', 1416, 1434, ...
14 [['I', 1533, 1545, 487, 541, 'I', 1548, 1551, ...
15 [['P', 1659, 1677, 490, 514, '2', 1680, 1697, ...
16 [['1', 1890, 1904, 486, 505, '8', 1905, 1916, ...
17 [['B', 1344, 1361, 583, 607, 'O', 1364, 1386, ...
18 [['B', 1548, 1580, 979, 1015, 'T', 1586, 1619,...
19 [['Q', 169, 190, 1291, 1312, 'U', 192, 210, 12...
20 [['1', 296, 305, 1492, 1516, 'S', 339, 357, 14...
21 [['G', 339, 362, 1815, 1840, 'S', 365, 384, 18...
22 [['2', 1440, 1455, 2047, 2073, '9', 1458, 1475...
23 [['R', 339, 360, 2137, 2163, 'e', 363, 378, 21...
24 [['R', 339, 360, 1860, 1885, 'e', 363, 380, 18...
25 [['0', 1266, 1283, 1951, 1977, ',', 1287, 1290...
26 [['1', 2207, 2217, 1492, 1515, '0', 2225, 2240...
27 [['1', 2364, 2382, 1552, 1585], [], ['E', 2369...
28 [['S', 2369, 2382, 1833, 1866]]
29 [['0', 2243, 2259, 1951, 1977, '0', 2271, 2288...
30 [['0', 2243, 2259, 2227, 2253, '0', 2271, 2286...
31 [['D', 76, 88, 2580, 2596, 'é', 91, 100, 2580,...
32 [['ü', 1474, 1489, 2586, 2616, '3', 1541, 1557...
33 [['E', 1440, 1461, 2670, 2697, 'U', 1466, 1488...
34 [['2', 1685, 1703, 2670, 2697, '.', 1707, 1712...
35 [['1', 2202, 2213, 2668, 2695, '3', 2220, 2237...
36 [['c', 88, 118, 2872, 2902]]
37 [['N', 127, 144, 2889, 2910, 'D', 156, 175, 28...
38 [['E', 108, 129, 3144, 3172, 'C', 133, 156, 31...
39 [['5', 108, 126, 3204, 3231, '0', 129, 147, 32...
40 [[]]
41 [['1', 480, 492, 3202, 3229, '6', 500, 518, 32...
42 [['P', 217, 234, 3337, 3360, 'A', 235, 255, 33...
43 [[]]
44 [['I', 954, 963, 2892, 2934, 'M', 969, 1011, 2...
45 [['E', 1385, 1407, 2970, 2998, 'U', 1410, 1433...
46 [['T', 2067, 2084, 2889, 2911, 'O', 2088, 2106...
47 [['1', 2201, 2213, 2970, 2997, '6', 2219, 2238...
48 [['M', 1734, 1755, 3246, 3267, 'O', 1758, 1779...
49 [['L', 923, 935, 3411, 3430, 'A', 941, 957, 34...
Name: character_position, dtype: object
Then in my char.csv l do the following
df = pd.read_csv('list_characters.csv', header=None, usecols=[1], names=['character_position'])
df = df.replace(['\[','\]'], ['',''], regex=True)
cols = ['char','left','right','top','bottom']
df1 = df.positionlrtb.str.strip('[]').str.split(',', expand=True)
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)
df1.columns = cols
df1.char = df1.char.replace(['\[','\]'], ['',''], regex=True)
df1['left']=df1['left'].replace(['\[','\]'], ['',''], regex=True)
df1['top']=df1['top'].replace(['\[','\]'], ['',''], regex=True)
df1['right']=df1['right'].replace(['\[','\]'], ['',''], regex=True)
df1['bottom']=df1['bottom'].replace(['\[','\]'], ['',''], regex=True)
df1.to_csv('chars.csv')
However l don't see in your response how you added the columns from_line and all_char_in_same_rows.
when l execute your line of code :
df_data = df_data.character_position.str.strip('[]').str.split(',', expand=True)
l get the following :
df_data[0:10]
0 1 2 3 4 5 6 7 8 9 ... \
0 'm' 38 104 2456 2492 'i' 40 102 2442 2448 ...
1 '.' 203 213 191 198 '3' 235 262 131 198 ...
2 'A' 275 347 147 239 'M' 363 465 145 239 ...
3 'A' 73 91 373 394 'D' 93 112 373 396 ...
4 'D' 454 473 663 685 'O' 474 495 664 687 ...
5 'A' 108 129 727 751 'V' 129 150 727 753 ...
6 'N' 34 51 949 970 '/' 52 61 948 970 ...
7 'S' 1368 1401 43 85 'A' 1406 1446 43 85 ...
8 'S' 1437 1457 112 138 'o' 1458 1476 118 138 ...
9 'h' 1686 1703 315 339 't' 1706 1715 316 339 ...
1821 1822 1823 1824 1825 1826 1827 1828 1829 1830
0 None None None None None None None None None None
1 None None None None None None None None None None
2 None None None None None None None None None None
3 None None None None None None None None None None
4 None None None None None None None None None None
5 None None None None None None None None None None
6 None None None None None None None None None None
Here are the 10 first lines of my csv file :
character_position
0 [['m', 38, 104, 2456, 2492, 'i', 40, 102, 2442, 2448, 'i', 40, 100, 2402, 2410, 'l', 40, 102, 2372, 2382, 'm', 40, 102, 2312, 2358, 'u', 40, 102, 2292, 2310, 'i', 40, 104, 2210, 2260, 'l', 40, 104, 2180, 2208, 'i', 40, 104, 2140, 2166, 'l', 40, 104, 2124, 2134]]
1 [['.', 203, 213, 191, 198, '3', 235, 262, 131, 198]]
2 [['A', 275, 347, 147, 239, 'M', 363, 465, 145, 239, 'S', 485, 549, 145, 243, 'U', 569, 631, 145, 241, 'N', 657, 733, 145, 239]]
3 [['A', 73, 91, 373, 394, 'D', 93, 112, 373, 396, 'R', 115, 133, 373, 396, 'E', 136, 153, 373, 396, 'S', 156, 172, 373, 396, 'S', 175, 192, 373, 396, 'E', 195, 211, 373, 396, 'D', 222, 241, 373, 396, 'E', 244, 261, 373, 396, 'L', 272, 285, 375, 396, 'I', 288, 293, 375, 396, 'V', 296, 314, 375, 396, 'R', 317, 334, 373, 396, 'A', 334, 354, 375, 396, 'I', 357, 360, 373, 396, 'S', 365, 381, 373, 396, 'O', 384, 405, 373, 396, 'N', 408, 425, 373, 394]]
4 [['D', 454, 473, 663, 685, 'O', 474, 495, 664, 687, 'C', 498, 516, 664, 687, 'U', 519, 536, 663, 687, 'M', 540, 561, 663, 687, 'E', 564, 581, 663, 685, 'N', 584, 600, 664, 685, 'T', 603, 618, 663, 685]]
5 [['A', 108, 129, 727, 751, 'V', 129, 150, 727, 753, 'O', 153, 175, 727, 753, 'I', 178, 183, 727, 751, 'R', 187, 210, 727, 751, 'S', 220, 240, 727, 753, 'U', 243, 263, 727, 753, 'R', 267, 288, 727, 751, 'F', 302, 318, 727, 751, 'A', 320, 341, 727, 751, 'C', 342, 363, 726, 751, 'T', 366, 384, 726, 750, 'U', 387, 407, 727, 751, 'R', 411, 432, 727, 751, 'E', 435, 453, 726, 751, 'P', 797, 815, 727, 751, 'A', 818, 839, 727, 751, 'G', 840, 863, 727, 751, 'E', 867, 885, 726, 751, '1', 900, 911, 727, 751, '1', 926, 934, 727, 751, '1', 947, 956, 727, 751, '5', 962, 979, 727, 751], ['R', 120, 142, 778, 807, 'T', 144, 165, 778, 805, 'T', 178, 199, 778, 805, 'e', 201, 219, 786, 807, 'c', 222, 240, 786, 807, 'h', 241, 258, 778, 807, 'n', 263, 279, 786, 807, 'i', 284, 287, 778, 805, 'c', 291, 308, 786, 807, 'a', 309, 327, 786, 807, 'R', 350, 374, 778, 807, 'e', 377, 395, 786, 807, 't', 396, 405, 780, 805, 'u', 408, 425, 786, 807, 'r', 429, 440, 786, 807, 'n', 441, 458, 786, 807, '-', 471, 482, 793, 798, 'D', 497, 518, 778, 807, 'O', 522, 548, 777, 807, 'A', 549, 573, 778, 807, '/', 585, 596, 778, 807, 'D', 606, 630, 778, 807, 'A', 632, 656, 778, 807, 'P', 659, 680, 778, 805]]
6 [['N', 34, 51, 949, 970, '/', 52, 61, 948, 970, 'C', 63, 81, 948, 970, 'O', 84, 103, 948, 970, 'M', 106, 127, 949, 970, 'M', 130, 151, 948, 970, 'A', 153, 172, 949, 970, 'N', 175, 192, 949, 970, 'D', 195, 213, 948, 970, 'E', 217, 232, 948, 970], ['1', 73, 84, 993, 1020, '1', 94, 105, 993, 1020, '8', 112, 130, 991, 1020, '4', 135, 153, 993, 1018, '5', 156, 172, 994, 1018, '7', 175, 192, 993, 1018, '6', 195, 213, 993, 1020, '0', 216, 235, 991, 1020, '6', 238, 257, 993, 1020, '5', 260, 278, 993, 1020, '0', 407, 425, 991, 1020, '9', 428, 446, 991, 1020, '.', 450, 455, 1015, 1020, '0', 459, 477, 991, 1020, '1', 485, 494, 994, 1018, '.', 503, 507, 1015, 1020, '2', 512, 530, 991, 1020, '0', 533, 551, 991, 1020, '1', 555, 566, 993, 1020, '5', 575, 593, 993, 1020, 'R', 632, 656, 991, 1020, 'M', 659, 684, 991, 1020, 'A', 689, 713, 991, 1020, 'N', 726, 747, 993, 1020, 'o', 752, 770, 999, 1020, '.', 774, 779, 1015, 1020, '5', 794, 812, 993, 1020, '8', 815, 833, 991, 1020, '4', 834, 852, 993, 1017, '4', 857, 873, 994, 1018, '3', 878, 896, 991, 1020, '8', 899, 917, 991, 1020, '0', 920, 938, 991, 1020, '/', 950, 960, 991, 1020, '0', 971, 990, 993, 1020, '7', 995, 1011, 993, 1018, '1', 1016, 1026, 993, 1018, '6', 1034, 1052, 993, 1020, '7', 1055, 1073, 993, 1020, '4', 1076, 1094, 993, 1018, '8', 1098, 1116, 991, 1020, '9', 1119, 1137, 991, 1020, '0', 1140, 1158, 993, 1020, '9', 1160, 1178, 991, 1020], ['N', 34, 51, 1045, 1066, '/', 54, 61, 1045, 1066, 'B', 63, 79, 1044, 1066, 'O', 82, 102, 1044, 1066, 'N', 105, 121, 1045, 1066, 'D', 133, 151, 1045, 1066, 'E', 156, 172, 1044, 1066, 'L', 183, 196, 1045, 1066, 'I', 199, 204, 1045, 1066, 'V', 205, 223, 1045, 1066, 'R', 226, 244, 1045, 1066, 'A', 246, 266, 1045, 1066, 'I', 267, 272, 1045, 1066, 'S', 275, 291, 1044, 1066, 'O', 294, 314, 1045, 1066, 'N', 318, 335, 1045, 1066], ['8', 72, 90, 1093, 1122, '2', 93, 109, 1093, 1122, '5', 114, 132, 1095, 1122, '9', 135, 153, 1093, 1122, '7', 154, 172, 1095, 1122, '1', 178, 189, 1093, 1122, '3', 196, 214, 1093, 1122, '1', 220, 231, 1095, 1122, '0', 238, 257, 1093, 1122, '3', 260, 278, 1093, 1122, '0', 407, 425, 1093, 1122, '6', 429, 447, 1095, 1122, '.', 452, 455, 1117, 1122, '0', 459, 477, 1093, 1122, '2', 480, 498, 1093, 1122, '.', 503, 507, 1117, 1122, '2', 512, 530, 1093, 1122, '0', 533, 551, 1093, 1122, '1', 557, 567, 1095, 1122, '5', 575, 593, 1095, 1122], ['v', 70, 90, 1150, 1171, '/', 88, 97, 1150, 1171, 'r', 100, 118, 1150, 1171, 'é', 121, 136, 1144, 1173, 'f', 141, 156, 1150, 1171, 'ê', 159, 174, 1144, 1173, 'r', 177, 195, 1150, 1173, 'e', 198, 214, 1150, 1171, 'n', 217, 234, 1150, 1171, 'c', 238, 257, 1149, 1171, 'e', 260, 276, 1149, 1173, 'B', 476, 497, 1152, 1179, 'O', 501, 527, 1149, 1179, 'G', 530, 555, 1150, 1180, 'D', 560, 582, 1152, 1179, 'O', 585, 611, 1149, 1179, 'A', 614, 638, 1150, 1179, '1', 642, 653, 1152, 1179, '5', 659, 677, 1153, 1180, 'B', 681, 701, 1152, 1179, 'T', 705, 726, 1152, 1179, '0', 728, 746, 1152, 1179, '6', 749, 767, 1152, 1179]]
7 [['S', 1368, 1401, 43, 85, 'A', 1406, 1446, 43, 85, 'M', 1451, 1491, 36, 85, 'S', 1500, 1533, 43, 85, 'U', 1539, 1574, 43, 85, 'N', 1581, 1616, 43, 85, 'G', 1623, 1662, 42, 85, 'E', 1686, 1719, 43, 85, 'L', 1725, 1755, 43, 85, 'E', 1763, 1794, 42, 85, 'C', 1800, 1836, 43, 85, 'T', 1841, 1874, 42, 85, 'R', 1880, 1914, 42, 84, 'O', 1919, 1959, 42, 85, 'N', 1965, 1998, 42, 84, 'I', 2007, 2016, 42, 84, 'C', 2022, 2058, 42, 84, 'S', 2066, 2099, 42, 84, 'F', 2121, 2151, 42, 84, 'R', 2159, 2193, 42, 84, 'A', 2198, 2237, 40, 84, 'N', 2243, 2277, 40, 84, 'C', 2285, 2321, 42, 84, 'E', 2328, 2360, 40, 84]]
8 [['S', 1437, 1457, 112, 138, 'o', 1458, 1476, 118, 138, 'c', 1479, 1493, 120, 138, 'i', 1494, 1499, 112, 136, 'é', 1503, 1518, 114, 138, 't', 1520, 1527, 115, 138, 'é', 1530, 1547, 112, 138, 'p', 1559, 1575, 120, 144, 'a', 1577, 1593, 118, 138, 'r', 1596, 1607, 118, 136, 'A', 1616, 1637, 112, 136, 'c', 1640, 1653, 118, 138, 't', 1655, 1664, 115, 136, 'i', 1665, 1670, 112, 136, 'o', 1673, 1688, 118, 138, 'n', 1692, 1707, 118, 136, 's', 1710, 1725, 118, 138, 'S', 1736, 1755, 112, 138, 'i', 1760, 1763, 112, 136, 'm', 1767, 1791, 118, 136, 'p', 1794, 1811, 118, 142, 'l', 1812, 1817, 112, 136, 'i', 1821, 1824, 112, 136, 'f', 1827, 1835, 112, 136, 'i', 1835, 1841, 112, 136, 'é', 1845, 1860, 112, 136, 'e', 1863, 1878, 118, 136, 'a', 1890, 1907, 118, 138, 'u', 1910, 1925, 118, 136, 'C', 1937, 1958, 112, 136, 'a', 1961, 1977, 118, 136, 'p', 1980, 1995, 118, 142, 'i', 1998, 2003, 112, 136, 't', 2006, 2013, 114, 136, 'a', 2015, 2030, 118, 136, 'l', 2034, 2037, 112, 136, 'd', 2051, 2066, 111, 136, 'e', 2069, 2085, 117, 136, '2', 2097, 2112, 112, 136, '7', 2115, 2132, 111, 136, '.', 2136, 2139, 132, 136, '0', 2144, 2159, 111, 136, '0', 2162, 2178, 111, 136, '0', 2180, 2196, 111, 136, '.', 2201, 2205, 132, 135, '0', 2208, 2225, 111, 136, '0', 2228, 2243, 111, 136, '0', 2246, 2261, 111, 136, 't', 2273, 2281, 112, 135, 'i', 2281, 2291, 111, 136], ['1', 1473, 1482, 153, 177, ',', 1491, 1494, 172, 181, 'r', 1508, 1517, 159, 177, 'u', 1520, 1535, 160, 177, 'e', 1538, 1554, 159, 177, 'F', 1566, 1583, 153, 177, 'r', 1587, 1596, 159, 177, 'u', 1598, 1613, 159, 177, 'c', 1617, 1631, 159, 177, 't', 1634, 1641, 154, 177, 'i', 1643, 1646, 153, 177, 'd', 1650, 1665, 151, 177, 'o', 1668, 1685, 159, 177, 'r', 1688, 1697, 159, 177, 'C', 1709, 1730, 153, 177, 'S', 1733, 1751, 153, 177, '2', 1764, 1779, 153, 177, '0', 1781, 1797, 153, 177, '0', 1800, 1817, 153, 177, '3', 1820, 1835, 151, 177, '9', 1847, 1863, 151, 177, '3', 1866, 1883, 151, 177, '4', 1883, 1901, 153, 175, '8', 1904, 1919, 151, 177, '4', 1919, 1937, 153, 175, 'S', 1950, 1968, 151, 177, 'A', 1971, 1992, 151, 175, 'I', 1995, 2000, 151, 175, 'N', 2004, 2024, 151, 175, 'T', 2027, 2046, 151, 175, 'O', 2058, 2081, 151, 177, 'U', 2085, 2105, 151, 177, 'E', 2109, 2127, 151, 177, 'N', 2130, 2150, 151, 175, 'C', 2163, 2186, 151, 175, 'e', 2187, 2204, 157, 175, 'd', 2207, 2222, 150, 175, 'e', 2225, 2240, 157, 175, 'x', 2243, 2258, 157, 175], ['T', 1638, 1656, 192, 216, 'É', 1659, 1677, 186, 217, 'L', 1682, 1697, 193, 217, 'É', 1701, 1719, 187, 217, 'P', 1722, 1742, 192, 217, 'H', 1746, 1766, 193, 217, 'O', 1770, 1793, 192, 217, 'N', 1796, 1815, 192, 216, 'E', 1820, 1838, 192, 217, '0', 1869, 1886, 190, 216, '1', 1890, 1899, 192, 216, '4', 1914, 1931, 193, 216, '4', 1934, 1950, 193, 216, '0', 1961, 1977, 190, 216, '4', 1980, 1997, 193, 216, '7', 2009, 2024, 192, 216, '0', 2027, 2042, 192, 216, '0', 2055, 2070, 192, 216, '0', 2073, 2090, 192, 216], ['R', 1517, 1538, 232, 258, '.', 1542, 1545, 253, 256, 'C', 1550, 1571, 232, 256, '.', 1575, 1580, 252, 256, 'S', 1584, 1602, 232, 256, '.', 1607, 1611, 252, 256, 'B', 1625, 1643, 232, 256, 'O', 1649, 1670, 231, 258, 'B', 1674, 1692, 232, 256, 'I', 1697, 1701, 232, 256, 'G', 1706, 1728, 232, 256, 'N', 1731, 1751, 232, 256, 'Y', 1754, 1775, 232, 256, 'B', 1788, 1806, 232, 256, '3', 1818, 1835, 231, 256, '3', 1838, 1855, 231, 256, '4', 1855, 1872, 232, 255, '3', 1884, 1899, 232, 256, '6', 1904, 1919, 232, 256, '7', 1922, 1937, 232, 256, '4', 1947, 1964, 232, 256, '9', 1967, 1983, 232, 256, '7', 1986, 2001, 232, 256, '-', 2013, 2022, 244, 249, 'A', 2034, 2055, 231, 255, 'P', 2057, 2075, 231, 255, 'E', 2079, 2097, 231, 256, '4', 2109, 2126, 232, 255, '6', 2129, 2145, 232, 256, '5', 2148, 2163, 232, 256, '2', 2166, 2183, 232, 255, 'Z', 2193, 2211, 231, 255], ['C', 1628, 1647, 271, 297, 'o', 1652, 1670, 279, 297, 'd', 1671, 1689, 273, 297, 'e', 1692, 1709, 279, 298, 'T', 1721, 1739, 273, 297, 'V', 1742, 1763, 273, 297, 'A', 1763, 1787, 273, 297, 'F', 1818, 1835, 273, 297, 'R', 1839, 1859, 273, 297, '8', 1872, 1889, 273, 297, '9', 1890, 1905, 273, 297, '3', 1919, 1932, 273, 297, '3', 1937, 1952, 273, 297, '4', 1953, 1971, 273, 297, '3', 1983, 1998, 273, 297, '6', 2001, 2018, 273, 297, '7', 2021, 2036, 273, 295, '4', 2048, 2064, 274, 297, '9', 2066, 2082, 273, 297, '7', 2085, 2100, 273, 295]]
9 [['h', 1686, 1703, 315, 339, 't', 1706, 1715, 316, 339, 't', 1718, 1727, 316, 339, 'p', 1730, 1748, 321, 345, 'i', 1751, 1757, 321, 339, 'f', 1760, 1769, 315, 339, '/', 1769, 1776, 313, 339, 'w', 1779, 1804, 321, 337, 'w', 1804, 1829, 321, 339, 'w', 1830, 1854, 321, 337, '.', 1859, 1863, 333, 337, 's', 1868, 1883, 319, 339, 'a', 1886, 1901, 321, 337, 'm', 1905, 1929, 321, 337, 's', 1932, 1949, 321, 339, 'u', 1953, 1968, 321, 339, 'n', 1973, 1989, 321, 339, 'g', 1992, 2010, 319, 345, '.', 2015, 2019, 333, 337, 'f', 2021, 2033, 313, 337, 'r', 2034, 2045, 319, 337]]
10 [['N', 1331, 1349, 370, 391, 'C', 1361, 1379, 370, 393, 'O', 1382, 1403, 370, 393, 'M', 1404, 1425, 370, 391, 'P', 1430, 1446, 370, 391, 'T', 1448, 1464, 370, 391, 'E', 1467, 1484, 370, 393, 'C', 1494, 1512, 370, 393, 'L', 1515, 1532, 370, 393, 'I', 1533, 1539, 370, 393, 'E', 1542, 1559, 370, 393, 'N', 1560, 1580, 370, 393, 'T', 1580, 1598, 370, 393]]
here is the second csv file:
char left right top bottom
0 'm' 38 104 2456 2492
1 'i' 40 102 2442 2448
2 'i' 40 100 2402 2410
3 'l' 40 102 2372 2382
4 'm' 40 102 2312 2358
5 'u' 40 102 2292 2310
6 'i' 40 104 2210 2260
7 'l' 40 104 2180 2208
8 'i' 40 104 2140 2166
EDIT1
HERE IS MY output for solution 2 (`input character_position described` )
1831 1830 level_2 char left top right bottom FromLine all_chars_in_same_row
0 0 character_position 0 character_position 0 character_position
1 1 'm','i','i','l','m','u','i','l','i','l' 0 'm' 38 104 2456 2492 1 'm','i','i','l','m','u','i','l','i','l'
2 1 'm','i','i','l','m','u','i','l','i','l' 1 'i' 40 102 2442 2448 1 'm','i','i','l','m','u','i','l','i','l'
3 1 'm','i','i','l','m','u','i','l','i','l' 2 'i' 40 100 2402 2410 1 'm','i','i','l','m','u','i','l','i','l'
l think the probelm comes from the fact that l have in my data :
[[',' , 'A', ',' , '.', ':' , ';', '1'], [], ['m', 'a',]]
so :
empty `[ ]` causes problem for the order. l noticed that when l tried to omit all [] which are empty beacause l find my csv as follow :
in char : ['a' rather than 'a' for values 8794] rather than 8794 or
[5345 rather than 5345
so processed the csv as follow
df = pd.read_csv(filepath_or_buffer='lit_charaters.csv', header=None, usecols=[1,3], names=['character_position','LineIndex'])
df = df.replace(['\[','\]'], ['',''], regex=True)
cols = ['char','left','right','top','bottom','LineIndex']
df1 = df.positionlrtb.str.strip('[]').str.split(',', expand=True)
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)
df1.columns = cols
df1.char = df1.char.replace(['\[','\]'], ['',''], regex=True)
df1['left']=df1['left'].replace(['\[','\]'], ['',''], regex=True)
df1['top']=df1['top'].replace(['\[','\]'], ['',''], regex=True)
df1['right']=df1['right'].replace(['\[','\]'], ['',''], regex=True)
df1['bottom']=df1['bottom'].replace(['\[','\]'], ['',''], regex=True)
df1.to_csv('char.csv')
Then l noticed the following
look at line 1221 column B it's empty it replaces [] then we get the disorder of columns switched (B and C) due to empty char . How to solve that ?
l have also empty line
3831 '6' 296 314 3204 3231
3832
3833 '1' 480 492 3202 3229
Line 3832 should be removed.
in order to get something like this
**EDIT2:**
In order to solve the problem of empty rows and [] in list_characters.csv
[['1', 2364, 2382, 1552, 1585], [], ['E', 2369, 2381, 1623, 1640], ['8', 2369, 2382, 1644, 1668]]
and
[[]] [[]]
l did the following :
df1 = df.applymap(lambda x: [y for y in x if len(y) > 0])
df1 = df1[df1.applymap(len).ne(0).all(axis=1)]
df1 = df.replace(['\[\],','\[\[\]\]', ''],['','', np.nan], regex=True)
df1 = df1.dropna()
then
df = pd.read_csv('character_position.csv', index_col=0)
df.positionlrtb = df.positionlrtb.apply(ast.literal_eval)
df.positionlrtb = df.positionlrtb.apply(lambda x: [y for y in x if len(y) > 0])
print (df.head())
page_number positionlrtb \
0 1841729699_001 [[m, 38, 104, 2456, 2492, i, 40, 102, 2442, 24...
1 1841729699_001 [[., 203, 213, 191, 198, 3, 235, 262, 131, 198]]
2 1841729699_001 [[A, 275, 347, 147, 239, M, 363, 465, 145, 239...
3 1841729699_001 [[A, 73, 91, 373, 394, D, 93, 112, 373, 396, R...
4 1841729699_001 [[D, 454, 473, 663, 685, O, 474, 495, 664, 687...
LineIndex
0 [[mi, il, mu, il, il]]
1 [[.3]]
2 [[amsun]]
3 [[adresse, de, livraison]]
4 [[document]]
cols = ['char','left','top','right','bottom']
df1 = pd.DataFrame({
"a": np.repeat(df.page_number.values, df.positionlrtb.str.len()),
"b": list(chain.from_iterable(df.positionlrtb))})
df1 = pd.DataFrame(df1.b.values.tolist())
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)
cols = ['char','left','top','right','bottom']
df1.columns = cols
df1[cols[1:]] = df1[cols[1:]].astype(int)
print (df1)
char left top right bottom
0 m 38 104 2456 2492
1 i 40 102 2442 2448
2 i 40 100 2402 2410
3 l 40 102 2372 2382
4 m 40 102 2312 2358
5 u 40 102 2292 2310
6 i 40 104 2210 2260
7 l 40 104 2180 2208
8 i 40 104 2140 2166
However :
df_data = df_data.character_position.str.strip('[]').str.split(', ', expand=True)
returns None values
Once you create required data frame, after stacking , don't remove Index, it holds your line number. Since this is a multilevel indexing , get the first Index- your line number.
df_data['LineIndex'] = df_data.index.get_level_values(0)
Then you can group by LineIndex column and get all characters for the common LineIndex. This is created as dictionary. Convert this dictionary into data frame and then finally merge this into the actual data
Solution 1
import pandas as pd
df_data=pd.read_csv('list_characters.csv' , header=None, usecols=[1], names=['character_position'])
df_data = df_data.character_position.str.strip('[]').str.split(', ', expand=True)
df_data.columns = [df_data.columns % 5, df_data.columns // 5]
df_data = df_data.stack() # dont remove Index, it has the line from where this record was created
print df_data
df_data['FromLine'] = df_data.index.get_level_values(0) #assign line number to a column
cols = ['char','left','top','right','bottom','FromLine']
df_data.columns = cols #assign the new column names
#create a new dictionary
#it contains the line number as key and all the characters from that line as value
DictChar= {k: list(v) for k,v in df_data.groupby("FromLine")["char"]}
#convert dictionary to a dataframe
df_chars=pd.DataFrame(DictChar.items())
df_chars.columns=cols = ['FromLine','char']
# Merge dataframes on column 'FromLine'
df_final=df_data.merge(df_chars,on ='FromLine')
cols = ['char','left','top','right','bottom','FromLine','all_chars_in_same_row']
df_final.columns=cols
print df_final
Solution 2
I personally prefer this solution over the first one. See inline comments for more details
import pandas as pd
df_data=pd.read_csv('list_characters.csv', header=None, usecols=[1], names=['character_position'])
df_data = df_data.character_position.str.strip('[]').str.split(', ', expand=True)
x=len(df_data.columns) #get total number of columns
#get all characters from every 5th column, concatenate and create new column in df_data
df_data[x] = df_data[df_data.columns[::5]].apply(lambda x: ','.join(x.dropna()), axis=1)
# get index of each row. This is the line number for your record
df_data[x+1]=df_data.index.get_level_values(0)
# now set line number and character columns as Index of data frame
df_data.set_index([x+1,x],inplace=True,drop=True)
df_data.columns = [df_data.columns % 5, df_data.columns // 5]
df_data = df_data.stack()
df_data['FromLine'] = df_data.index.get_level_values(0) #assign line number to a column
df_data['all_chars_in_same_row'] = df_data.index.get_level_values(1) #assign character values to a column
cols = ['char','left','top','right','bottom','FromLine','all_chars_in_same_row']
df_data.columns=cols
df_data.reset_index(inplace=True) #remove mutiindexing
print df_data[cols]
Output
char left top right bottom from line all_chars_in_same_row
0 '.' 203 213 191 198 0 ['.', '3', 'C']
1 '3' 1758 1775 370 391 0 ['.', '3', 'C']
2 'C' 296 305 1492 1516 0 ['.', '3', 'C']
3 'A' 275 347 147 239 1 ['A', 'M', 'D']
4 'M' 2166 2184 370 391 1 ['A', 'M', 'D']
5 'D' 339 362 1815 1840 1 ['A', 'M', 'D']
6 'A' 73 91 373 394 2 ['A', 'D', 'A']
7 'D' 1395 1415 427 454 2 ['A', 'D', 'A']
8 'A' 1440 1455 2047 2073 2 ['A', 'D', 'A']
9 'D' 454 473 663 685 3 ['D', 'O', '0']
10 'O' 1533 1545 487 541 3 ['D', 'O', '0']
11 '0' 339 360 2137 2163 3 ['D', 'O', '0']
12 'A' 108 129 727 751 4 ['A', 'V', 'I']
13 'V' 1659 1677 490 514 4 ['A', 'V', 'I']
14 'I' 339 360 1860 1885 4 ['A', 'V', 'I']
15 'N' 34 51 949 970 5 ['N', '/', '2']
16 '/' 1890 1904 486 505 5 ['N', '/', '2']
17 '2' 1266 1283 1951 1977 5 ['N', '/', '2']
18 'S' 1368 1401 43 85 6 ['S', 'A', '8']
19 'A' 1344 1361 583 607 6 ['S', 'A', '8']
20 '8' 2207 2217 1492 1515 6 ['S', 'A', '8']
21 'S' 1437 1457 112 138 7 ['S', 'o', 'O']
22 'o' 1548 1580 979 1015 7 ['S', 'o', 'O']
23 'O' 1331 1349 370 391 7 ['S', 'o', 'O']
24 'h' 1686 1703 315 339 8 ['h', 't', 't']
25 't' 169 190 1291 1312 8 ['h', 't', 't']
26 't' 169 190 1291 1312 8 ['h', 't', 't']
27 'N' 1331 1349 370 391 9 ['N', 'C', 'C']
28 'C' 296 305 1492 1516 9 ['N', 'C', 'C']
29 'C' 296 305 1492 1516 9 ['N', 'C', 'C']

Python dictionary comprehension with Pandas

I am trying to create a dictionary from two columns of a DataFrame (df)
mydict={x :y for x in df['Names'] for y in df['Births']}
But all of the values are the same(the last value in the column)!
{'Bob': 973, 'Jessica': 973, 'John': 973, 'Mary': 973, 'Mel': 973}
I checked the column and it has many other values, what am I doing wrong?
I think Abdou hit the nail on the head with dict(zip(dff['Names'], dff['Births'])), but if you want to do it with a dict comprehension you can do this:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame(
...: [{'Births': 971, 'Names': 'Bob'},
...: {'Births': 972, 'Names': 'Jessica'},
...: {'Births': 973, 'Names': 'John'},
...: {'Births': 974, 'Names': 'Mary'},
...: {'Births': 975, 'Names': 'Mel'}])
In [3]: {d['Names']: d['Births'] for d in df.to_dict(orient='records')}
Out[3]: {'Bob': 971, 'Jessica': 972, 'John': 973, 'Mary': 974, 'Mel': 975}
try
my_dict = {row.Names:row.Births for (index,row) in df.iterows()}

Categories

Resources