Plotting select rows and columns of DataFrame (python) - python

I have searched around but have not found an exact way to do this. I have a data frame of several baseball teams and their statistics. Like the following:
RK TEAM GP AB R H 2B 3B HR TB RBI AVG OBP SLG OPS
1 Milwaukee 163 5542 754 1398 252 24 218 2352 711 .252.323 .424 .747
2 Chicago Cubs 163 5624 761 1453 286 34 167 2308 722 .258.333 .410 .744
3 LA Dodgers 163 5572 804 1394 296 33 235 2461 756 .250.333 .442 .774
4 Colorado 163 5541 780 1418 280 42 210 2412 748 .256.322 .435 .757
5 Baltimore 162 5507 622 1317 242 15 188 2153 593 .239.298 .391 .689
I want to be able to plot two teams on the X-axis and then perhaps 3 metrics (ex: R, H, TB) on the Y-axis with the two teams side by side in bar chart format. I haven't been able to figure out how to do this. Any ideas?
Thank you.

My approach was to create a new dataframe which only contains the columns you are interested in plotting:
import pandas as pd
import matplotlib.pyplot as plt
data = [[1, 'Milwaukee', 163, 5542, 754, 1398, 252, 24, 218, 2352, 711, .252, .323, .424, .747],
[2, 'Chicago Cubs', 163, 5624, 761, 1453, 286, 34, 167, 2308, 722, .258, .333, .410, .744],
[3, 'LA Dodgers', 163, 5572, 804, 1394, 296, 33, 235, 2461, 756, .250, .333, .442, .774],
[4, 'Colorado', 163, 5541, 780, 1418, 280, 42, 210, 2412, 748, .256, .322, .435, .757],
[5, 'Baltimore', 162, 5507, 622, 1317, 242, 15, 188, 2153, 593, .239, .298, .391, .689]]
df = pd.DataFrame(data, columns=['RK', 'TEAM', 'GP', 'AB', 'R', 'H', '2B', '3B', 'HR', 'TB', 'RBI', 'AVG', 'OBP', 'SLG', 'OPS'])
dfplot = df[['TEAM', 'R', 'H', 'TB']].copy()
fig = plt.figure()
ax = fig.add_subplot(111)
width = 0.4
dfplot.plot(kind='bar', x='TEAM', ax=ax, width=width, position=1)
plt.show()
This creates the following output:

Related

Is there a way to reshape a single index pandas DataFrame into a multi index to adapt to time series?

Here's a sample data frame:
import pandas as pd
sample_dframe = pd.DataFrame.from_dict(
{
"id": [123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 456, 456, 456, 456, 456, 456, 456, 456, 456, 456, 456, 456],
"V1": [2552, 813, 496, 401, 4078, 952, 7279, 544, 450,548, 433,4696, 244,9735, 4263,642, 255,2813, 496,401, 4078952, 7279544],
"V2": [3434, 133, 424, 491, 8217, 915, 7179, 5414, 450, 548, 433, 4696, 244, 9735, 4263, 642, 255, 2813, 496, 401, 4952, 4453],
"V3": [382,161, 7237, 7503, 561, 6801, 1072, 9660, 62107, 6233, 5403, 3745, 8613, 6302, 557, 4256, 9874, 3013, 9352, 4522, 3232, 58830],
"V4": [32628, 4471, 4781, 1497, 45104, 8657, 81074, 1091, 370835, 2058, 4447, 7376, 302237, 6833, 48348, 3545, 4263,642, 255,2813, 4088920, 6323521]
}
)
The data frame looks like this:
The above sample shape is (22, 5) and has columns id, V1..V4. I need to convert this into a multi index data frame (as a time series), where for a given id, I need to group 5 values (time steps) from each of V1..V4 for a given id.
i.e., it should give me a frame of shape (2, 4, 5) since there are 2 unique id values.
IIUC, you might just want:
sample_dframe.set_index('id').stack()
NB. the output is a Series, for a DataFrame add .to_frame(name='col_name').
Output:
id
123 V1 2552
V2 3434
V3 382
V4 32628
V1 813
...
456 V4 4088920
V1 7279544
V2 4453
V3 58830
V4 6323521
Length: 88, dtype: int64
Or, maybe:
(sample_dframe
.assign(time=lambda d: d.groupby('id').cumcount())
.set_index(['id', 'time']).stack()
.swaplevel('time', -1)
)
Output:
id time
123 V1 0 2552
V2 0 3434
V3 0 382
V4 0 32628
V1 1 813
...
456 V4 10 4088920
V1 11 7279544
V2 11 4453
V3 11 58830
V4 11 6323521
Length: 88, dtype: int64
import itertools
import timeit
from pandas import DataFrame
import numpy as np
import pandas as pd
from datetime import datetime
from pandas import DataFrame
import functools as ft
df= pd.DataFrame.from_dict(
{
"id": [123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 456, 456, 456, 456, 456, 456, 456, 456, 456, 456, 456, 456],
"V1": [2552, 813, 496, 401, 4078, 952, 7279, 544, 450,548, 433,4696, 244,9735, 4263,642, 255,2813, 496,401, 4078952, 7279544],
"V2": [3434, 133, 424, 491, 8217, 915, 7179, 5414, 450, 548, 433, 4696, 244, 9735, 4263, 642, 255, 2813, 496, 401, 4952, 4453],
"V3": [382,161, 7237, 7503, 561, 6801, 1072, 9660, 62107, 6233, 5403, 3745, 8613, 6302, 557, 4256, 9874, 3013, 9352, 4522, 3232, 58830],
"V4": [32628, 4471, 4781, 1497, 45104, 8657, 81074, 1091, 370835, 2058, 4447, 7376, 302237, 6833, 48348, 3545, 4263,642, 255,2813, 4088920, 6323521]
}
)
print(df)
"""
id V1 V2 V3 V4
0 123 2552 3434 382 32628
1 123 813 133 161 4471
2 123 496 424 7237 4781
3 123 401 491 7503 1497
4 123 4078 8217 561 45104
5 123 952 915 6801 8657
6 123 7279 7179 1072 81074
7 123 544 5414 9660 1091
8 123 450 450 62107 370835
9 123 548 548 6233 2058
10 456 433 433 5403 4447
11 456 4696 4696 3745 7376
12 456 244 244 8613 302237
13 456 9735 9735 6302 6833
14 456 4263 4263 557 48348
15 456 642 642 4256 3545
16 456 255 255 9874 4263
17 456 2813 2813 3013 642
18 456 496 496 9352 255
19 456 401 401 4522 2813
20 456 4078952 4952 3232 4088920
21 456 7279544 4453 58830 6323521
"""
df = df.set_index('id').stack().reset_index().drop(columns = 'level_1').rename(columns = {0:'V1_new'})
print(df)
"""
id V1_new
0 123 2552
1 123 3434
2 123 382
3 123 32628
4 123 813
.. ... ...
83 456 4088920
84 456 7279544
85 456 4453
86 456 58830
87 456 6323521
"""

How to build a histogram from a pandas dataframe where each observation is a list?

I have a dataframe as follows. The values are in a cell, a list of elements. I want to visualize distribution of the values from the "Values" column using histogram"S" stacked in rows OR separated by colours (Area_code).
How can I get the values and construct histogram"S" in plotly? Any other idea also welcome. Thank you.
Area_code Values
0 New_York [999, 54, 231, 43, 177, 313, 212, 279, 199, 267]
1 Dallas [915, 183, 2326, 316, 206, 31, 317, 26, 31, 56, 316]
2 XXX [560]
3 YYY [884, 13]
4 ZZZ [203, 1066, 453, 266, 160, 109, 45, 627, 83, 685, 120, 410, 151, 33, 618, 164, 496]
If you reshape your data, this would be a perfect case for px.histogram. And from there you can opt between several outputs like sum, average, count through the histfunc method:
fig = px.histogram(df, x = 'Area_code', y = 'Values', histfunc='sum')
fig.show()
You haven't specified what kind of output you're aiming for, but I'll leave it up to you to change the argument for histfunc and see which option suits your needs best.
I'm often inclined to urge users to rethink their entire data process, but I'm just going to assume that there are good reasons why you're stuck with what seems like a pretty weird setup in your dataframe. The snippet below contains a complete data munginge process to reshape your data from your setup, to a so-called long format:
Area_code Values
0 New_York 999
1 New_York 54
2 New_York 231
3 New_York 43
4 New_York 177
5 New_York 313
6 New_York 212
7 New_York 279
8 New_York 199
9 New_York 267
10 Dallas 915
11 Dallas 183
12 Dallas 2326
13 Dallas 316
14 Dallas 206
15 Dallas 31
16 Dallas 317
17 Dallas 26
18 Dallas 31
19 Dallas 56
20 Dallas 316
21 XXX 560
22 YYY 884
23 YYY 13
24 ZZZ 203
And this is a perfect format for many of the great functionalites of plotly.express.
Complete code:
import plotly.graph_objects as go
import plotly.express as px
import pandas as pd
# data input
df = pd.DataFrame({'Area_code': {0: 'New_York', 1: 'Dallas', 2: 'XXX', 3: 'YYY', 4: 'ZZZ'},
'Values': {0: [999, 54, 231, 43, 177, 313, 212, 279, 199, 267],
1: [915, 183, 2326, 316, 206, 31, 317, 26, 31, 56, 316],
2: [560],
3: [884, 13],
4: [203, 1066, 453, 266, 160, 109, 45, 627, 83, 685, 120, 410, 151, 33, 618, 164, 496]}})
# data munging
areas = []
value = []
for i, row in df.iterrows():
# print(row['Values'])
for j, val in enumerate(row['Values']):
areas.append(row['Area_code'])
value.append(val)
df = pd.DataFrame({'Area_code': areas,
'Values': value})
# plotly
fig = px.histogram(df, x = 'Area_code', y = 'Values', histfunc='sum')
fig.show()

Save the data in txt file without brackets and comma (Python)

My source data is appended in the following array:
x,y,w,h = track_window
listing.append([x,y]):
where
x = [300, 300, 300, 296, 291, 294, 299, 284, 303, 323, 334, 343, 354, 362, 362, 362, 360, 361, 351]
and
y = [214, 216, 214, 214, 216, 216, 215, 219, 217, 220, 218, 218, 222, 222, 222, 223, 225, 224, 222]
So x values should be written to a text file and after that y values without comma and brackets like this form, where there is an empty space between every two numbers and (8) numbers in each row.
x:
300 300 300 296 291 294 299 284
303 323 334 343 354 362 362 362
360 361 351
y:
300 300 300 296 291 294 299 284
303 323 334 343 354 362 362 362
360 361 351
How can i achieve it?
What i did:
with open('text.txt', 'rb') as f:
val = pickle.load(f)
for i in range(2):
if i==0:
#def Extract(lst):
a = [item[0] for item in val]
#return a
if i==1:
#def Extract(lst):
b = [item[1] for item in val]
#return b
#print(val)
#print(Extract(val))
print(a)
print(b)
f.close()
Thank you
It would be good to see what you have tried. Nonetheless, this will work for you.
def write_in_chunks(f, lst, n):
for i in range(0, len(lst), n):
chunk = lst[i : i+n]
f.write(" ".join(str(val) for val in chunk) + "\n")
x = [300, 300, 300, 296, 291, 294, 299, 284, 303, 323, 334, 343, 354, 362, 362, 362, 360, 361, 351]
y = [214, 216, 214, 214, 216, 216, 215, 219, 217, 220, 218, 218, 222, 222, 222, 223, 225, 224, 222]
with open("output.txt", "w") as f:
write_in_chunks(f, x, 8)
write_in_chunks(f, y, 8)
Creates output.txt containing
300 300 300 296 291 294 299 284
303 323 334 343 354 362 362 362
360 361 351
214 216 214 214 216 216 215 219
217 220 218 218 222 222 222 223
225 224 222
Adding extra blank lines in the output is left as an exercise for the reader... (hint: see where existing newlines are written).

Changing a value of an observation conditional to the value of the next observation by group

I have a dataset of trips by person (trips_data). Each observation is a trip with the start time of the trip (strttime), the end time of the trip (endtime) and the person who does the trip. For some person, the endtime a trip is later than the start of the next trip. Here is an example with the time in format hhmm :
TRIPID clepersonne strttime endtime
90 100010413 10001041 1600 1614
91 100010414 10001041 1615 1648
92 100010415 10001041 1645 1726
93 100010416 10001041 1930 1954
94 100010621 10001062 900 921
95 100010622 10001062 1000 1013
The TRIPID 100010414 terminate later than the strttime of the next trip 100010415 for the same personne 10001041. I would like to correct this inconsistency by replacind the endtime of the trip 100010414 by the start time of the next trip. For this example, the result that I want is:
TRIPID clepersonne strttime endtime
90 100010413 10001041 1600 1614
91 100010414 10001041 1615 *1645*
92 100010415 10001041 1645 1726
93 100010416 10001041 1930 1954
94 100010621 10001062 900 921
95 100010622 10001062 1000 1013
I have tried doing this :
trips_data = trips_data.sort_index() # To iterate each value
for i in range(0, len(trips_data.index)) :
trips_data['endtime'] = np.where((trips_data.strttime[i+1]<trips_data.endtime[i]) & (trips_data.clepersonne[i+1] == trips_data.clepersonne[i]), trips_data.strttime[i+1], trips_data['endtime'] )
But I get this error:
Traceback (most recent call last):
File "C:\Users\Utilisateur\AppData\Roaming\Python\Python37\site-packages\IPython\core\interactiveshell.py", line 3326, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-23-64092a4318fd>", line 3, in <module>
trips_data['endtime'] = np.where((trips_data.strttime[i+1]<trips_data.endtime[i]) & (trips_data.clepersonne[i+1] == trips_data.clepersonne[i]), trips_data.strttime[i+1], trips_data['endtime'] )
File "C:\Users\Utilisateur\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\series.py", line 1071, in __getitem__
result = self.index.get_value(self, key)
File "C:\Users\Utilisateur\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\indexes\base.py", line 4730, in get_value
return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
File "pandas/_libs/index.pyx", line 80, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 88, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 992, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 998, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 122
Can you help me?
Thanks
Use:
next_start=df.groupby('clepersonne')['strttime'].shift(-1)
mask=df['endtime'].sub(next_start)>0
df['endtime']=df['endtime'].mask(mask,next_start)
print(df)
TRIPID clepersonne strttime endtime
90 100010413 10001041 1600 1614
91 100010414 10001041 1615 1645
92 100010415 10001041 1645 1726
93 100010416 10001041 1930 1954
94 100010621 10001062 900 921
95 100010622 10001062 1000 1013

Remove empty rows and empty [ ] using python

l have 10000 rows in my csv file. l want to remove empty bracket [] and rows wich are empty [[]] and it is depicted in the following picture:
For instance the first cell in the first column :
[['1', 2364, 2382, 1552, 1585], [], ['E', 2369, 2381, 1623, 1640], ['8', 2369, 2382, 1644, 1668]]
needs to be transformed into:
[['1', 2364, 2382, 1552, 1585],['E', 2369, 2381, 1623, 1640], ['8', 2369, 2382, 1644, 1668]]
and the row with only empty bracket:
[[]] [[]]
needs to be removed from the file. As a result we get:
l tried :
df1 = df.Column_1.str.strip([]).str.split(',', expand=True)
My data are from string class
print(type(df.loc[0,'Column_1']))
<class 'str'>
print(type(df.loc[0,'Column_2']))
<class 'str'>
EDIT1
After executing the following code :
df1 = df.applymap(lambda x: [y for y in x if len(y) > 0])
df1 = df1[df1.applymap(len).ne(0).all(axis=1)]
df1 = df.replace(['\[\],','\[\[\]\]', ''],['','', np.nan], regex=True)
df1 = df1.dropna()
it solves the problem. However l got some issue with comma (as a character and not a delimiter) ','
from the resulted line. l wanted to create a new csv file as follow :
columns =['char', 'left', 'right', 'top', 'down']
which corresponds for instance to :
'1' 2364 2382 1552 1585
to get a csv file as follow :
char left top right bottom
0 'm' 38 104 2456 2492
1 'i' 40 102 2442 222
2 '.' 203 213 191 198
3 '3' 235 262 131 3333
4 'A' 275 347 147 239
5 'M' 363 465 145 3334
6 'A' 73 91 373 394
7 'D' 93 112 373 39
8 'D' 454 473 663 685
9 'O' 474 495 664 33
10 'A' 108 129 727 751
11 'V' 129 150 727 444
so the whole code to get this is :
df1 = df.applymap(lambda x: [y for y in x if len(y) > 0])
df1 = df1[df1.applymap(len).ne(0).all(axis=1)]
df1 = df.replace(['\[\],','\[\[\]\]', ''],['','', np.nan], regex=True)
df1 = df1.dropna()
cols = ['char','left','right','top','bottom']
df1 = df.positionlrtb.str.strip('[]').str.split(',', expand=True)
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)
df1.columns = cols
df1.char = df1.char.replace(['\[','\]'], ['',''], regex=True)
df1['left']=df1['left'].replace(['\[','\]'], ['',''], regex=True)
df1['top']=df1['top'].replace(['\[','\]'], ['',''], regex=True)
df1['right']=df1['right'].replace(['\[','\]'], ['',''], regex=True)
df1['bottom']=df1['bottom'].replace(['\[','\]'], ['',''], regex=True)
However doing that l don't find any ',' in my file then it makes disorder in the new csv file rather getting :
',' 1491 1494 172 181
l got no comma ',' .and the disorder is explained in the following two lines :
' ' 1491 1494 172
181 'r' 1508 1517 159
it should be :
',' 1491 1494 172 181
'r' 1508 1517 159 ... and so on
EDIT2
l'm trying to add 2 other column called line_number and all_chars_in_same_row
1)line_number corresponds to the line where for example
'm' 38 104 2456 2492
is extracted let say from line 2
2) all_chars_in_same_row corresponds to all (spaced) characters which are in the same row. for instance
character_position = [['1', 1890, 1904, 486, 505, '8', 1905, 1916, 486, 507, '4', 1919, 1931, 486, 505, '1', 1935, 1947, 486, 505, '7', 1950, 1962, 486, 505, '2', 1965, 1976, 486, 505, '9', 1980, 1992, 486, 507, '6', 1995, 2007, 486, 505, '/', 2010, 2022, 484, 508, '4', 2025, 2037, 486, 505, '8', 2040, 2052, 486, 505, '3', 2057, 2067, 486, 507, '3', 2072, 2082, 486, 505, '0', 2085, 2097, 486, 507, '/', 2100, 2112, 484, 508, 'Q', 2115, 2127, 486, 507, '1', 2132, 2144, 486, 505, '7', 2147, 2157, 486, 505, '9', 2162, 2174, 486, 505, '/', 2175, 2189, 484, 508, 'C', 2190, 2204, 487, 505, '4', 2207, 2219, 486, 505, '1', 2241, 2253, 486, 505, '/', 2255, 2268, 484, 508, '1', 2271, 2285, 486, 507, '5', 2288, 2297, 486, 505], ['D', 2118, 2132, 519, 535, '.', 2138, 2144, 529, 534, '2', 2150, 2162, 516, 535, '0', 2165, 2177, 516, 535, '4', 2180, 2192, 516, 534, '7', 2196, 2208, 516, 534, '0', 2210, 2223, 514, 535, '1', 2226, 2238, 516, 534, '8', 2241, 2253, 514, 534, '2', 2256, 2267, 514, 535, '4', 2270, 2282, 516, 534, '0', 2285, 2298, 514, 535]]
l get '1' '8' '4' '1' '7' and so on.
more formally : all_chars_in_same_row means: write all the character of the given row in line_number column
char left top right bottom line_number all_chars_in_same_row
0 'm' 38 104 2456 2492 from line 2 'm' '2' '5' 'g'
1 'i' 40 102 2442 222 from line 4
2 '.' 203 213 191 198 from line 6
3 '3' 235 262 131 3333
4 'A' 275 347 147 239
5 'M' 363 465 145 3334
6 'A' 73 91 373 394
7 'D' 93 112 373 39
8 'D' 454 473 663 685
9 'O' 474 495 664 33
10 'A' 108 129 727 751
11 'V' 129 150 727 444
The code related to that is :
import pandas as pd
df_data=pd.read_csv('see2.csv', header=None, usecols=[1], names=['character_position'])
df_data = df_data.positionlrtb.str.strip('[]').str.split(', ', expand=True)
x=len(df_data.columns) #get total number of columns
#get all characters from every 5th column, concatenate and create new column in df_data
df_data[x] = df_data[df_data.columns[::5]].apply(lambda x: ','.join(x.dropna()), axis=1)
# get index of each row. This is the line number for your record
df_data[x+1]=df_data.index.get_level_values(0)
# now set line number and character columns as Index of data frame
df_data.set_index([x+1,x],inplace=True,drop=True)
df_data.columns = [df_data.columns % 5, df_data.columns // 5]
df_data = df_data.stack()
df_data['FromLine'] = df_data.index.get_level_values(0) #assign line number to a column
df_data['all_chars_in_same_row'] = df_data.index.get_level_values(1) #assign character values to a column
cols = ['char','left','top','right','bottom','FromLine','all_chars_in_same_row']
df_data.columns=cols
df_data.reset_index(inplace=True) #remove mutiindexing
print df_data[cols]
and output
char left top right bottom from line all_chars_in_same_row
0 '.' 203 213 191 198 0 ['.', '3', 'C']
1 '3' 1758 1775 370 391 0 ['.', '3', 'C']
2 'C' 296 305 1492 1516 0 ['.', '3', 'C']
3 'A' 275 347 147 239 1 ['A', 'M', 'D']
4 'M' 2166 2184 370 391 1 ['A', 'M', 'D']
5 'D' 339 362 1815 1840 1 ['A', 'M', 'D']
6 'A' 73 91 373 394 2 ['A', 'D', 'A']
7 'D' 1395 1415 427 454 2 ['A', 'D', 'A']
8 'A' 1440 1455 2047 2073 2 ['A', 'D', 'A']
9 'D' 454 473 663 685 3 ['D', 'O', '0']
10 'O' 1533 1545 487 541 3 ['D', 'O', '0']
11 '0' 339 360 2137 2163 3 ['D', 'O', '0']
12 'A' 108 129 727 751 4 ['A', 'V', 'I']
13 'V' 1659 1677 490 514 4 ['A', 'V', 'I']
14 'I' 339 360 1860 1885 4 ['A', 'V', 'I']
15 'N' 34 51 949 970 5 ['N', '/', '2']
16 '/' 1890 1904 486 505 5 ['N', '/', '2']
17 '2' 1266 1283 1951 1977 5 ['N', '/', '2']
18 'S' 1368 1401 43 85 6 ['S', 'A', '8']
19 'A' 1344 1361 583 607 6 ['S', 'A', '8']
20 '8' 2207 2217 1492 1515 6 ['S', 'A', '8']
21 'S' 1437 1457 112 138 7 ['S', 'o', 'O']
22 'o' 1548 1580 979 1015 7 ['S', 'o', 'O']
23 'O' 1331 1349 370 391 7 ['S', 'o', 'O']
24 'h' 1686 1703 315 339 8 ['h', 't', 't']
25 't' 169 190 1291 1312 8 ['h', 't', 't']
26 't' 169 190 1291 1312 8 ['h', 't', 't']
27 'N' 1331 1349 370 391 9 ['N', 'C', 'C']
28 'C' 296 305 1492 1516 9 ['N', 'C', 'C']
29 'C' 296 305 1492 1516 9 ['N', 'C', 'C']
However, l got a strange results(order of letter, numbers, columns ,headers..). l can't share them the file is too long . i tried to share it. but lt exceeds the max characters.
where this line of code
df_data = df_data.character_position.str.strip('[]').str.split(', ', expand=True)
return None Value
0 1 2 3 4 5 6 7 8 9 ... \
0 'm' 38 104 2456 2492 'i' 40 102 2442 2448 ...
1 '.' 203 213 191 198 '3' 235 262 131 198 ...
2 'A' 275 347 147 239 'M' 363 465 145 239 ...
3 'A' 73 91 373 394 'D' 93 112 373 396 ...
4 'D' 454 473 663 685 'O' 474 495 664 687 ...
5 'A' 108 129 727 751 'V' 129 150 727 753 ...
6 'N' 34 51 949 970 '/' 52 61 948 970 ...
7 'S' 1368 1401 43 85 'A' 1406 1446 43 85 ...
8 'S' 1437 1457 112 138 'o' 1458 1476 118 138 ...
9 'h' 1686 1703 315 339 't' 1706 1715 316 339 ...
1821 1822 1823 1824 1825 1826 1827 1828 1829 1830
0 None None None None None None None None None None
1 None None None None None None None None None None
2 None None None None None None None None None None
3 None None None None None None None None None None
4 None None None None None None None None None None
5 None None None None None None None None None None
6 None None None None None None None None None None
EDIT3
However, when l add page_number along with character_position
df1 = pd.DataFrame({
"from_line": np.repeat(df.index.values, df.character_position.str.len()),
"b": list(chain.from_iterable(df.character_position)),
"page_number" : np.repeat(df.index.values,df['page_number'])
})
l got the following error :
File "/usr/local/lib/python3.5/dist-packages/numpy/core/fromnumeric.py", line 47, in _wrapit
result = getattr(asarray(obj), method)(*args, **kwds)
TypeError: Cannot cast array data from dtype('O') to dtype('int64') according to the rule 'safe'
For lists you can use applymap with list comprehension for remove [] first and then remove all rows with boolean indexing, where mask check if all values in row is no 0 - empty lists.
df1 = df.applymap(lambda x: [y for y in x if len(y) > 0])
df1 = df1[df1.applymap(len).ne(0).all(axis=1)]
If need remove row if any value is [[]]:
df1 = df1[~(df1.applymap(len).eq(0)).any(1)]
If values are strings:
df1 = df.replace(['\[\],','\[\[\]\]', ''],['','', np.nan], regex=True)
and then dropna:
df1 = df1.dropna(how='all')
Or:
df1 = df1.dropna()
EDIT1:
df = pd.read_csv('see2.csv', index_col=0)
df.positionlrtb = df.positionlrtb.apply(ast.literal_eval)
df.positionlrtb = df.positionlrtb.apply(lambda x: [y for y in x if len(y) > 0])
print (df.head())
page_number positionlrtb \
0 1841729699_001 [[m, 38, 104, 2456, 2492, i, 40, 102, 2442, 24...
1 1841729699_001 [[., 203, 213, 191, 198, 3, 235, 262, 131, 198]]
2 1841729699_001 [[A, 275, 347, 147, 239, M, 363, 465, 145, 239...
3 1841729699_001 [[A, 73, 91, 373, 394, D, 93, 112, 373, 396, R...
4 1841729699_001 [[D, 454, 473, 663, 685, O, 474, 495, 664, 687...
LineIndex
0 [[mi, il, mu, il, il]]
1 [[.3]]
2 [[amsun]]
3 [[adresse, de, livraison]]
4 [[document]]
cols = ['char','left','top','right','bottom']
df1 = pd.DataFrame({
"a": np.repeat(df.page_number.values, df.positionlrtb.str.len()),
"b": list(chain.from_iterable(df.positionlrtb))})
df1 = pd.DataFrame(df1.b.values.tolist())
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)
cols = ['char','left','top','right','bottom']
df1.columns = cols
df1[cols[1:]] = df1[cols[1:]].astype(int)
print (df1)
char left top right bottom
0 m 38 104 2456 2492
1 i 40 102 2442 2448
2 i 40 100 2402 2410
3 l 40 102 2372 2382
4 m 40 102 2312 2358
5 u 40 102 2292 2310
6 i 40 104 2210 2260
7 l 40 104 2180 2208
8 i 40 104 2140 2166
EDIT2:
#skip first row
df = pd.read_csv('see2.csv', usecols=[2], names=['character_position'], skiprows=1)
print (df.head())
character_position
0 [['m', 38, 104, 2456, 2492, 'i', 40, 102, 2442...
1 [['.', 203, 213, 191, 198, '3', 235, 262, 131,...
2 [['A', 275, 347, 147, 239, 'M', 363, 465, 145,...
3 [['A', 73, 91, 373, 394, 'D', 93, 112, 373, 39...
4 [['D', 454, 473, 663, 685, 'O', 474, 495, 664,...
#convert to list, remove empty lists
df.character_position = df.character_position.apply(ast.literal_eval)
df.character_position = df.character_position.apply(lambda x: [y for y in x if len(y) > 0])
#new df - http://stackoverflow.com/a/42788093/2901002
df1 = pd.DataFrame({
"from line": np.repeat(df.index.values, df.character_position.str.len()),
"b": list(chain.from_iterable(df.character_position))})
#filter by list comprehension string only, convert to tuple, because need create index
df1['all_chars_in_same_row'] =
df1['b'].apply(lambda x: tuple([y for y in x if isinstance(y, str)]))
df1 = df1.set_index(['from line','all_chars_in_same_row'])
#new df from column b
df1 = pd.DataFrame(df1.b.values.tolist(), index=df1.index)
#Multiindex in columns
df1.columns = [df1.columns % 5, df1.columns // 5]
#reshape
df1 = df1.stack().reset_index(level=2, drop=True)
cols = ['char','left','top','right','bottom']
df1.columns = cols
#convert last columns to int
df1[cols[1:]] = df1[cols[1:]].astype(int)
df1 = df1.reset_index()
#convert tuples to list
df1['all_chars_in_same_row'] = df1['all_chars_in_same_row'].apply(list)
print (df1.head(15))
from line all_chars_in_same_row char left top right bottom
0 0 [m, i, i, l, m, u, i, l, i, l] m 38 104 2456 2492
1 0 [m, i, i, l, m, u, i, l, i, l] i 40 102 2442 2448
2 0 [m, i, i, l, m, u, i, l, i, l] i 40 100 2402 2410
3 0 [m, i, i, l, m, u, i, l, i, l] l 40 102 2372 2382
4 0 [m, i, i, l, m, u, i, l, i, l] m 40 102 2312 2358
5 0 [m, i, i, l, m, u, i, l, i, l] u 40 102 2292 2310
6 0 [m, i, i, l, m, u, i, l, i, l] i 40 104 2210 2260
7 0 [m, i, i, l, m, u, i, l, i, l] l 40 104 2180 2208
8 0 [m, i, i, l, m, u, i, l, i, l] i 40 104 2140 2166
9 0 [m, i, i, l, m, u, i, l, i, l] l 40 104 2124 2134
10 1 [., 3] . 203 213 191 198
11 1 [., 3] 3 235 262 131 198
12 2 [A, M, S, U, N] A 275 347 147 239
13 2 [A, M, S, U, N] M 363 465 145 239
14 2 [A, M, S, U, N] S 485 549 145 243
You could use a list comprehension for this:
arr = [['1', 2364, 2382, 1552, 1585], [], ['E', 2369, 2381, 1623, 1640], ['8', 2369, 2382, 1644, 1668]]
new_arr = [x for x in arr if x]
Or perhaps you prefer list + filter:
new_arr = list(filter(lambda x: x, arr))
The reason the lambda x: x works in this case is because that particular lambda is testing whether a given x in arr is "truthy." More specifically, that lambda will filter out elements in arr that are "falsey," like an empty list, []. It's almost like saying, "Keep everything in arr that 'exists'," so to speak.
new_list = []
for x in old_list:
if len(x) > 0:
new_list.append(x)
You could do this:
lst = [['1', 2364, 2382, 1552, 1585], [], ['E', 2369, 2381, 1623, 1640], ['8', 2369, 2382, 1644, 1668]]
new_lst = [i for i in lst if len(i) > 0]

Categories

Resources