Remove empty rows and empty [ ] using python - python
l have 10000 rows in my csv file. l want to remove empty bracket [] and rows wich are empty [[]] and it is depicted in the following picture:
For instance the first cell in the first column :
[['1', 2364, 2382, 1552, 1585], [], ['E', 2369, 2381, 1623, 1640], ['8', 2369, 2382, 1644, 1668]]
needs to be transformed into:
[['1', 2364, 2382, 1552, 1585],['E', 2369, 2381, 1623, 1640], ['8', 2369, 2382, 1644, 1668]]
and the row with only empty bracket:
[[]] [[]]
needs to be removed from the file. As a result we get:
l tried :
df1 = df.Column_1.str.strip([]).str.split(',', expand=True)
My data are from string class
print(type(df.loc[0,'Column_1']))
<class 'str'>
print(type(df.loc[0,'Column_2']))
<class 'str'>
EDIT1
After executing the following code :
df1 = df.applymap(lambda x: [y for y in x if len(y) > 0])
df1 = df1[df1.applymap(len).ne(0).all(axis=1)]
df1 = df.replace(['\[\],','\[\[\]\]', ''],['','', np.nan], regex=True)
df1 = df1.dropna()
it solves the problem. However l got some issue with comma (as a character and not a delimiter) ','
from the resulted line. l wanted to create a new csv file as follow :
columns =['char', 'left', 'right', 'top', 'down']
which corresponds for instance to :
'1' 2364 2382 1552 1585
to get a csv file as follow :
char left top right bottom
0 'm' 38 104 2456 2492
1 'i' 40 102 2442 222
2 '.' 203 213 191 198
3 '3' 235 262 131 3333
4 'A' 275 347 147 239
5 'M' 363 465 145 3334
6 'A' 73 91 373 394
7 'D' 93 112 373 39
8 'D' 454 473 663 685
9 'O' 474 495 664 33
10 'A' 108 129 727 751
11 'V' 129 150 727 444
so the whole code to get this is :
df1 = df.applymap(lambda x: [y for y in x if len(y) > 0])
df1 = df1[df1.applymap(len).ne(0).all(axis=1)]
df1 = df.replace(['\[\],','\[\[\]\]', ''],['','', np.nan], regex=True)
df1 = df1.dropna()
cols = ['char','left','right','top','bottom']
df1 = df.positionlrtb.str.strip('[]').str.split(',', expand=True)
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)
df1.columns = cols
df1.char = df1.char.replace(['\[','\]'], ['',''], regex=True)
df1['left']=df1['left'].replace(['\[','\]'], ['',''], regex=True)
df1['top']=df1['top'].replace(['\[','\]'], ['',''], regex=True)
df1['right']=df1['right'].replace(['\[','\]'], ['',''], regex=True)
df1['bottom']=df1['bottom'].replace(['\[','\]'], ['',''], regex=True)
However doing that l don't find any ',' in my file then it makes disorder in the new csv file rather getting :
',' 1491 1494 172 181
l got no comma ',' .and the disorder is explained in the following two lines :
' ' 1491 1494 172
181 'r' 1508 1517 159
it should be :
',' 1491 1494 172 181
'r' 1508 1517 159 ... and so on
EDIT2
l'm trying to add 2 other column called line_number and all_chars_in_same_row
1)line_number corresponds to the line where for example
'm' 38 104 2456 2492
is extracted let say from line 2
2) all_chars_in_same_row corresponds to all (spaced) characters which are in the same row. for instance
character_position = [['1', 1890, 1904, 486, 505, '8', 1905, 1916, 486, 507, '4', 1919, 1931, 486, 505, '1', 1935, 1947, 486, 505, '7', 1950, 1962, 486, 505, '2', 1965, 1976, 486, 505, '9', 1980, 1992, 486, 507, '6', 1995, 2007, 486, 505, '/', 2010, 2022, 484, 508, '4', 2025, 2037, 486, 505, '8', 2040, 2052, 486, 505, '3', 2057, 2067, 486, 507, '3', 2072, 2082, 486, 505, '0', 2085, 2097, 486, 507, '/', 2100, 2112, 484, 508, 'Q', 2115, 2127, 486, 507, '1', 2132, 2144, 486, 505, '7', 2147, 2157, 486, 505, '9', 2162, 2174, 486, 505, '/', 2175, 2189, 484, 508, 'C', 2190, 2204, 487, 505, '4', 2207, 2219, 486, 505, '1', 2241, 2253, 486, 505, '/', 2255, 2268, 484, 508, '1', 2271, 2285, 486, 507, '5', 2288, 2297, 486, 505], ['D', 2118, 2132, 519, 535, '.', 2138, 2144, 529, 534, '2', 2150, 2162, 516, 535, '0', 2165, 2177, 516, 535, '4', 2180, 2192, 516, 534, '7', 2196, 2208, 516, 534, '0', 2210, 2223, 514, 535, '1', 2226, 2238, 516, 534, '8', 2241, 2253, 514, 534, '2', 2256, 2267, 514, 535, '4', 2270, 2282, 516, 534, '0', 2285, 2298, 514, 535]]
l get '1' '8' '4' '1' '7' and so on.
more formally : all_chars_in_same_row means: write all the character of the given row in line_number column
char left top right bottom line_number all_chars_in_same_row
0 'm' 38 104 2456 2492 from line 2 'm' '2' '5' 'g'
1 'i' 40 102 2442 222 from line 4
2 '.' 203 213 191 198 from line 6
3 '3' 235 262 131 3333
4 'A' 275 347 147 239
5 'M' 363 465 145 3334
6 'A' 73 91 373 394
7 'D' 93 112 373 39
8 'D' 454 473 663 685
9 'O' 474 495 664 33
10 'A' 108 129 727 751
11 'V' 129 150 727 444
The code related to that is :
import pandas as pd
df_data=pd.read_csv('see2.csv', header=None, usecols=[1], names=['character_position'])
df_data = df_data.positionlrtb.str.strip('[]').str.split(', ', expand=True)
x=len(df_data.columns) #get total number of columns
#get all characters from every 5th column, concatenate and create new column in df_data
df_data[x] = df_data[df_data.columns[::5]].apply(lambda x: ','.join(x.dropna()), axis=1)
# get index of each row. This is the line number for your record
df_data[x+1]=df_data.index.get_level_values(0)
# now set line number and character columns as Index of data frame
df_data.set_index([x+1,x],inplace=True,drop=True)
df_data.columns = [df_data.columns % 5, df_data.columns // 5]
df_data = df_data.stack()
df_data['FromLine'] = df_data.index.get_level_values(0) #assign line number to a column
df_data['all_chars_in_same_row'] = df_data.index.get_level_values(1) #assign character values to a column
cols = ['char','left','top','right','bottom','FromLine','all_chars_in_same_row']
df_data.columns=cols
df_data.reset_index(inplace=True) #remove mutiindexing
print df_data[cols]
and output
char left top right bottom from line all_chars_in_same_row
0 '.' 203 213 191 198 0 ['.', '3', 'C']
1 '3' 1758 1775 370 391 0 ['.', '3', 'C']
2 'C' 296 305 1492 1516 0 ['.', '3', 'C']
3 'A' 275 347 147 239 1 ['A', 'M', 'D']
4 'M' 2166 2184 370 391 1 ['A', 'M', 'D']
5 'D' 339 362 1815 1840 1 ['A', 'M', 'D']
6 'A' 73 91 373 394 2 ['A', 'D', 'A']
7 'D' 1395 1415 427 454 2 ['A', 'D', 'A']
8 'A' 1440 1455 2047 2073 2 ['A', 'D', 'A']
9 'D' 454 473 663 685 3 ['D', 'O', '0']
10 'O' 1533 1545 487 541 3 ['D', 'O', '0']
11 '0' 339 360 2137 2163 3 ['D', 'O', '0']
12 'A' 108 129 727 751 4 ['A', 'V', 'I']
13 'V' 1659 1677 490 514 4 ['A', 'V', 'I']
14 'I' 339 360 1860 1885 4 ['A', 'V', 'I']
15 'N' 34 51 949 970 5 ['N', '/', '2']
16 '/' 1890 1904 486 505 5 ['N', '/', '2']
17 '2' 1266 1283 1951 1977 5 ['N', '/', '2']
18 'S' 1368 1401 43 85 6 ['S', 'A', '8']
19 'A' 1344 1361 583 607 6 ['S', 'A', '8']
20 '8' 2207 2217 1492 1515 6 ['S', 'A', '8']
21 'S' 1437 1457 112 138 7 ['S', 'o', 'O']
22 'o' 1548 1580 979 1015 7 ['S', 'o', 'O']
23 'O' 1331 1349 370 391 7 ['S', 'o', 'O']
24 'h' 1686 1703 315 339 8 ['h', 't', 't']
25 't' 169 190 1291 1312 8 ['h', 't', 't']
26 't' 169 190 1291 1312 8 ['h', 't', 't']
27 'N' 1331 1349 370 391 9 ['N', 'C', 'C']
28 'C' 296 305 1492 1516 9 ['N', 'C', 'C']
29 'C' 296 305 1492 1516 9 ['N', 'C', 'C']
However, l got a strange results(order of letter, numbers, columns ,headers..). l can't share them the file is too long . i tried to share it. but lt exceeds the max characters.
where this line of code
df_data = df_data.character_position.str.strip('[]').str.split(', ', expand=True)
return None Value
0 1 2 3 4 5 6 7 8 9 ... \
0 'm' 38 104 2456 2492 'i' 40 102 2442 2448 ...
1 '.' 203 213 191 198 '3' 235 262 131 198 ...
2 'A' 275 347 147 239 'M' 363 465 145 239 ...
3 'A' 73 91 373 394 'D' 93 112 373 396 ...
4 'D' 454 473 663 685 'O' 474 495 664 687 ...
5 'A' 108 129 727 751 'V' 129 150 727 753 ...
6 'N' 34 51 949 970 '/' 52 61 948 970 ...
7 'S' 1368 1401 43 85 'A' 1406 1446 43 85 ...
8 'S' 1437 1457 112 138 'o' 1458 1476 118 138 ...
9 'h' 1686 1703 315 339 't' 1706 1715 316 339 ...
1821 1822 1823 1824 1825 1826 1827 1828 1829 1830
0 None None None None None None None None None None
1 None None None None None None None None None None
2 None None None None None None None None None None
3 None None None None None None None None None None
4 None None None None None None None None None None
5 None None None None None None None None None None
6 None None None None None None None None None None
EDIT3
However, when l add page_number along with character_position
df1 = pd.DataFrame({
"from_line": np.repeat(df.index.values, df.character_position.str.len()),
"b": list(chain.from_iterable(df.character_position)),
"page_number" : np.repeat(df.index.values,df['page_number'])
})
l got the following error :
File "/usr/local/lib/python3.5/dist-packages/numpy/core/fromnumeric.py", line 47, in _wrapit
result = getattr(asarray(obj), method)(*args, **kwds)
TypeError: Cannot cast array data from dtype('O') to dtype('int64') according to the rule 'safe'
For lists you can use applymap with list comprehension for remove [] first and then remove all rows with boolean indexing, where mask check if all values in row is no 0 - empty lists.
df1 = df.applymap(lambda x: [y for y in x if len(y) > 0])
df1 = df1[df1.applymap(len).ne(0).all(axis=1)]
If need remove row if any value is [[]]:
df1 = df1[~(df1.applymap(len).eq(0)).any(1)]
If values are strings:
df1 = df.replace(['\[\],','\[\[\]\]', ''],['','', np.nan], regex=True)
and then dropna:
df1 = df1.dropna(how='all')
Or:
df1 = df1.dropna()
EDIT1:
df = pd.read_csv('see2.csv', index_col=0)
df.positionlrtb = df.positionlrtb.apply(ast.literal_eval)
df.positionlrtb = df.positionlrtb.apply(lambda x: [y for y in x if len(y) > 0])
print (df.head())
page_number positionlrtb \
0 1841729699_001 [[m, 38, 104, 2456, 2492, i, 40, 102, 2442, 24...
1 1841729699_001 [[., 203, 213, 191, 198, 3, 235, 262, 131, 198]]
2 1841729699_001 [[A, 275, 347, 147, 239, M, 363, 465, 145, 239...
3 1841729699_001 [[A, 73, 91, 373, 394, D, 93, 112, 373, 396, R...
4 1841729699_001 [[D, 454, 473, 663, 685, O, 474, 495, 664, 687...
LineIndex
0 [[mi, il, mu, il, il]]
1 [[.3]]
2 [[amsun]]
3 [[adresse, de, livraison]]
4 [[document]]
cols = ['char','left','top','right','bottom']
df1 = pd.DataFrame({
"a": np.repeat(df.page_number.values, df.positionlrtb.str.len()),
"b": list(chain.from_iterable(df.positionlrtb))})
df1 = pd.DataFrame(df1.b.values.tolist())
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)
cols = ['char','left','top','right','bottom']
df1.columns = cols
df1[cols[1:]] = df1[cols[1:]].astype(int)
print (df1)
char left top right bottom
0 m 38 104 2456 2492
1 i 40 102 2442 2448
2 i 40 100 2402 2410
3 l 40 102 2372 2382
4 m 40 102 2312 2358
5 u 40 102 2292 2310
6 i 40 104 2210 2260
7 l 40 104 2180 2208
8 i 40 104 2140 2166
EDIT2:
#skip first row
df = pd.read_csv('see2.csv', usecols=[2], names=['character_position'], skiprows=1)
print (df.head())
character_position
0 [['m', 38, 104, 2456, 2492, 'i', 40, 102, 2442...
1 [['.', 203, 213, 191, 198, '3', 235, 262, 131,...
2 [['A', 275, 347, 147, 239, 'M', 363, 465, 145,...
3 [['A', 73, 91, 373, 394, 'D', 93, 112, 373, 39...
4 [['D', 454, 473, 663, 685, 'O', 474, 495, 664,...
#convert to list, remove empty lists
df.character_position = df.character_position.apply(ast.literal_eval)
df.character_position = df.character_position.apply(lambda x: [y for y in x if len(y) > 0])
#new df - http://stackoverflow.com/a/42788093/2901002
df1 = pd.DataFrame({
"from line": np.repeat(df.index.values, df.character_position.str.len()),
"b": list(chain.from_iterable(df.character_position))})
#filter by list comprehension string only, convert to tuple, because need create index
df1['all_chars_in_same_row'] =
df1['b'].apply(lambda x: tuple([y for y in x if isinstance(y, str)]))
df1 = df1.set_index(['from line','all_chars_in_same_row'])
#new df from column b
df1 = pd.DataFrame(df1.b.values.tolist(), index=df1.index)
#Multiindex in columns
df1.columns = [df1.columns % 5, df1.columns // 5]
#reshape
df1 = df1.stack().reset_index(level=2, drop=True)
cols = ['char','left','top','right','bottom']
df1.columns = cols
#convert last columns to int
df1[cols[1:]] = df1[cols[1:]].astype(int)
df1 = df1.reset_index()
#convert tuples to list
df1['all_chars_in_same_row'] = df1['all_chars_in_same_row'].apply(list)
print (df1.head(15))
from line all_chars_in_same_row char left top right bottom
0 0 [m, i, i, l, m, u, i, l, i, l] m 38 104 2456 2492
1 0 [m, i, i, l, m, u, i, l, i, l] i 40 102 2442 2448
2 0 [m, i, i, l, m, u, i, l, i, l] i 40 100 2402 2410
3 0 [m, i, i, l, m, u, i, l, i, l] l 40 102 2372 2382
4 0 [m, i, i, l, m, u, i, l, i, l] m 40 102 2312 2358
5 0 [m, i, i, l, m, u, i, l, i, l] u 40 102 2292 2310
6 0 [m, i, i, l, m, u, i, l, i, l] i 40 104 2210 2260
7 0 [m, i, i, l, m, u, i, l, i, l] l 40 104 2180 2208
8 0 [m, i, i, l, m, u, i, l, i, l] i 40 104 2140 2166
9 0 [m, i, i, l, m, u, i, l, i, l] l 40 104 2124 2134
10 1 [., 3] . 203 213 191 198
11 1 [., 3] 3 235 262 131 198
12 2 [A, M, S, U, N] A 275 347 147 239
13 2 [A, M, S, U, N] M 363 465 145 239
14 2 [A, M, S, U, N] S 485 549 145 243
You could use a list comprehension for this:
arr = [['1', 2364, 2382, 1552, 1585], [], ['E', 2369, 2381, 1623, 1640], ['8', 2369, 2382, 1644, 1668]]
new_arr = [x for x in arr if x]
Or perhaps you prefer list + filter:
new_arr = list(filter(lambda x: x, arr))
The reason the lambda x: x works in this case is because that particular lambda is testing whether a given x in arr is "truthy." More specifically, that lambda will filter out elements in arr that are "falsey," like an empty list, []. It's almost like saying, "Keep everything in arr that 'exists'," so to speak.
new_list = []
for x in old_list:
if len(x) > 0:
new_list.append(x)
You could do this:
lst = [['1', 2364, 2382, 1552, 1585], [], ['E', 2369, 2381, 1623, 1640], ['8', 2369, 2382, 1644, 1668]]
new_lst = [i for i in lst if len(i) > 0]
Related
How to create files from a groupby object, based on the length of the dataframe
I have a dataframe (df) that looks like this (highly simplified): ID A B C VALUE 1 10 462 2241 217 2 11 498 6953 217 3 67 120 6926 654 4 68 898 7153 654 5 87 557 4996 654 6 88 227 6475 911 7 47 875 5097 911 8 48 143 8953 111 9 65 157 4470 111 10 66 525 9328 111 The 'VALUE' column contains a variable number of rows with identical values. I am trying to output a series of csv files that contain all of the rows that contain a 'VALUE' length == 2, ==3 etc. For example: to_csv('/Path/to/VALUE_len_2.csv') ID A B C VALUE 1 10 462 2241 217 2 11 498 6953 217 6 88 227 6475 911 7 47 875 5097 911 to_csv('/Path/to/VALUE_len_3.csv') ID A B C VALUE 3 67 120 6926 654 4 68 898 7153 654 5 87 557 4996 654 to_csv('/Path/to/VALUE_len_4.csv') ID A B C VALUE 7 47 875 5097 111 8 48 143 8953 111 9 65 157 4470 111 10 66 525 9328 111 I can get the desired output of one length value at a time, e.g., using: df = pd.concat(v for _, v in df.groupby("VALUE") if len(v) == 2) df.to_csv("/Path/to/VALUE_len_2.csv") However, I have dozens of values to test. I would like to put this in a for loop on the order of: mylist = [2,3,4,5,6,7,8,9] or len([2,3,4,5,6,7,8,9]) grouped = df.groupby(['VALUE']) output = '/Path/to/VALUE_len_{}.csv' for loop here: if item in my list found in grouped: output rows to csv else: pass I've tried various constructions to iterate the groupby object using the items in the list, and I haven't been able to make anything work. It might be an issue with trying to use a groupby object this way, but it is more than likely it is my inability to get the syntax right to complete the iteration.
It doesn't make sense to use a predetermined list to create the filenames. df_len will be used to generate a filename using an f-string. Path.exists() is used to determine if the file exists or not import pandas as pd from pathlib import Path # test data data = {'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'A': [10, 11, 67, 68, 87, 88, 47, 48, 65, 66], 'B': [462, 498, 120, 898, 557, 227, 875, 143, 157, 525], 'C': [2241, 6953, 6926, 7153, 4996, 6475, 5097, 8953, 4470, 9328], 'VALUE': [217, 217, 654, 654, 654, 911, 911, 111, 111, 111]} df = pd.DataFrame(data) # groupby value for group, data in df.groupby('VALUE'): # get the length of the dataframe df_len = len(data) # create a filename with df_len file = Path(f'/path/to/VALUE_len_{df_len}.csv') # if the file exists, append without the header if file.exists(): data.to_csv(file, index=False, mode='a', header=False) # create a new file else: data.to_csv(file, index=False) If you must only create a file for dataframes of a specific length desired_length = [2, 3, 4, 5, 6, 7, 8, 9] # groupby value for group, data in df.groupby('VALUE'): # get the length of the dataframe df_len = len(data) # create a filename with df_len file = Path(f'/path/to/VALUE_len_{df_len}.csv') # check if the length of the dataframe is in the desired length if df_len in desired_length: # if the file exists, append without the header if file.exists(): data.to_csv(file, index=False, mode='a', header=False) # create a new file else: data.to_csv(file, index=False)
# find the VALUE_len of every VALUE first df['VALUE_len'] = df.groupby('VALUE')['ID'].transform('count') cols = df.columns[:-1] # ['ID', 'A', 'B', 'C', 'VALUE'] # save group by VALUE_len for VALUE_len, group in df.groupby('VALUE_len'): file = Path(f'/path/to/VALUE_len_{VALUE_len}.csv') group[cols].to_csv(file, index=False)
Create new column with various conditional logic between other columns
I have the following dataset test = pd.DataFrame({'date':['2018-08-01','2018-08-02','2018-08-03','2019-09-01','2019-09-02','2019-09-03','2020-01-02','2020-01-03','2020-01-04','2020-10-04','2020-10-05'], 'account':['a','a','a','b','b','b','c','c','c','d','e'], 'tot_chg':[2072,2072,2072,322,322,322,483,483,483,140,570], 'denied':[1878,1036,1036,322,161,161,150,322,322,105,570], 'denied_sum':[1878,2914,3950,322,483,644,150,472,794,105,570]}) in which I would like to append a new column called denied_true based on the following parameters: while denied_sum is less than tot_chgs, return denied until the denied_sum exceeds tot_chgs, then compute the remaining difference between the sum of all prior denied_true less the tot_chgs and if denied ever equals tot_chgs at the first instance, just return denied and make remaining rows for the account 0 The output should effectively look like this: The dataframe for the output is: output = pd.DataFrame({'date':['2018-08-01','2018-08-02','2018-08-03','2019-09-01','2019-09-02','2019-09-03','2020-01-02','2020-01-03','2020-01-04','2020-10-04','2020-10-05'], 'account':['a','a','a','b','b','b','c','c','c','d','e'], 'tot_chg':[2072,2072,2072,322,322,322,483,483,483,140,570], 'denied':[1878,1036,1036,322,161,161,150,322,322,105,570], 'denied_sum':[1878,2914,3950,322,483,644,150,472,794,105,570], 'denied_true':[1878,194,0,322,0,0,150,322,11,105,570]}) So far, I have tried the following code using where, but it's missing the condition of subtract the previous denied_true value from the tot_chgs test['denied_true'] = test.denied_sum.to_numpy() test.denied_true.where(test.denied_sum.le(test.tot_chg),other=0,inplace=True) test However, I'm not really sure how to append multiple conditions to this where function. Maybe I need if/elif loops, or a boolean mask. Any help would be greatly appreciated!
You can convert DataFrame into OrderedDict and to handle it this straightforward way: import pandas as pd from collections import OrderedDict test = pd.DataFrame({'date': ['2018-08-01', '2018-08-02', '2018-08-03', '2019-09-01', '2019-09-02', '2019-09-03', '2020-01-02', '2020-01-03', '2020-01-04', '2020-10-04', '2020-10-05'], 'account': ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'd', 'e'], 'tot_chg': [2072, 2072, 2072, 322, 322, 322, 483, 483, 483, 140, 570], 'denied': [1878, 1036, 1036, 322, 161, 161, 150, 322, 322, 105, 570], 'denied_sum': [1878, 2914, 3950, 322, 483, 644, 150, 472, 794, 105, 570]}) # convert DataFrame into OrderedDict od = test.to_dict(into=OrderedDict) # functions (samples) def zero(dict, row): # if denied == denied_sum # change the dict... return dict['denied'][row] def ex(dict, row): # if exceeds # change the dict... return 'exceed()' def eq(dict, row): # if equals # change the dict... return 'equal()' def get_value(dict, row): # conditions if dict['denied'][row] == dict['denied_sum'][row]: return zero(dict, row) if dict['denied_sum'][row] < dict['tot_chg'][row]: return dict['denied'][row] if dict['denied_sum'][row] > dict['tot_chg'][row]: return ex(dict, row) if dict['denied_sum'][row] == dict['tot_chg'][row]: return eq(dict, row) # MAIN # make a list (column) of 'denied_true' values denied_true_list = [(row, get_value(od, row)) for row in range(len(od["date"]))] # convert the list into a dict denied_true_dict = {'denied_true': OrderedDict(denied_true_list)} # add the dict to the OrderedDict od.update(OrderedDict(denied_true_dict)) # convert the OrderedDict back into DataFrame test = pd.DataFrame(od) Input: date account tot_chg denied denied_sum 0 2018-08-01 a 2072 1878 1878 1 2018-08-02 a 2072 1036 2914 2 2018-08-03 a 2072 1036 3950 3 2019-09-01 b 322 322 322 4 2019-09-02 b 322 161 483 5 2019-09-03 b 322 161 644 6 2020-01-02 c 483 150 150 7 2020-01-03 c 483 322 472 8 2020-01-04 c 483 322 794 9 2020-10-04 d 140 105 105 10 2020-10-05 e 570 570 570 Output: date account tot_chg denied denied_sum denied_true 0 2018-08-01 a 2072 1878 1878 1878 1 2018-08-02 a 2072 1036 2914 exceed() 2 2018-08-03 a 2072 1036 3950 exceed() 3 2019-09-01 b 322 322 322 322 4 2019-09-02 b 322 161 483 exceed() 5 2019-09-03 b 322 161 644 exceed() 6 2020-01-02 c 483 150 150 150 7 2020-01-03 c 483 322 472 322 8 2020-01-04 c 483 322 794 exceed() 9 2020-10-04 d 140 105 105 105 10 2020-10-05 e 570 570 570 570 I didn't make a full implementation of your logic in the functions since it's just a sample. About the same (probably it would be a bit easer) can be done via DataFrame > JSON > DataFrame. Update. I've tried to implement the function ex(). Here is how it might look like. def ex(dict, row): # if exceeds denied_true_slice = denied_true_list[0:row] # <-- global list tot_chg_slice = [dict['tot_chg'][r] for r in range(row)] denied_true_sum = sum ([v for r, v in enumerate(denied_true_slice) if tot_chg_slice[r] > v]) value = tot_chg_slice[-1] - denied_true_sum return value if value > 0 else 0 I'm not quite sure if it works as supposed. Since I'm not fully understand the quirky conditions. But I'm sure it looks rather ugly and cryptic and probably isn't in line with best Stackoverflow's examples. Now there is the global list, so, MAIN section now looks like this: # MAIN # make a list (column) of 'denied_true' values denied_true_list = [] # <-- the global list for row, _ in enumerate(od['date']): denied_true_list.append(get_value(od,row)) denied_true_list = [(row, value) for row, value in enumerate(denied_true_list)] # convert the list into a dict denied_true_dict = {'denied_true': OrderedDict(denied_true_list)} # add the dict to the OrderedDict od.update(OrderedDict(denied_true_dict)) # convert the OrderedDict back into DataFrame test = pd.DataFrame(od) Output: date account tot_chg denied denied_sum denied_true 0 2018-08-01 a 2072 1878 1878 1878 1 2018-08-02 a 2072 1036 2914 194 2 2018-08-03 a 2072 1036 3950 0 3 2019-09-01 b 322 322 322 322 4 2019-09-02 b 322 161 483 0 5 2019-09-03 b 322 161 644 0 6 2020-01-02 c 483 150 150 150 7 2020-01-03 c 483 322 472 322 8 2020-01-04 c 483 322 794 0 9 2020-10-04 d 140 105 105 105 10 2020-10-05 e 570 570 570 570 I believe it could be done much more pretty via native Pandas tools.
Plotting select rows and columns of DataFrame (python)
I have searched around but have not found an exact way to do this. I have a data frame of several baseball teams and their statistics. Like the following: RK TEAM GP AB R H 2B 3B HR TB RBI AVG OBP SLG OPS 1 Milwaukee 163 5542 754 1398 252 24 218 2352 711 .252.323 .424 .747 2 Chicago Cubs 163 5624 761 1453 286 34 167 2308 722 .258.333 .410 .744 3 LA Dodgers 163 5572 804 1394 296 33 235 2461 756 .250.333 .442 .774 4 Colorado 163 5541 780 1418 280 42 210 2412 748 .256.322 .435 .757 5 Baltimore 162 5507 622 1317 242 15 188 2153 593 .239.298 .391 .689 I want to be able to plot two teams on the X-axis and then perhaps 3 metrics (ex: R, H, TB) on the Y-axis with the two teams side by side in bar chart format. I haven't been able to figure out how to do this. Any ideas? Thank you.
My approach was to create a new dataframe which only contains the columns you are interested in plotting: import pandas as pd import matplotlib.pyplot as plt data = [[1, 'Milwaukee', 163, 5542, 754, 1398, 252, 24, 218, 2352, 711, .252, .323, .424, .747], [2, 'Chicago Cubs', 163, 5624, 761, 1453, 286, 34, 167, 2308, 722, .258, .333, .410, .744], [3, 'LA Dodgers', 163, 5572, 804, 1394, 296, 33, 235, 2461, 756, .250, .333, .442, .774], [4, 'Colorado', 163, 5541, 780, 1418, 280, 42, 210, 2412, 748, .256, .322, .435, .757], [5, 'Baltimore', 162, 5507, 622, 1317, 242, 15, 188, 2153, 593, .239, .298, .391, .689]] df = pd.DataFrame(data, columns=['RK', 'TEAM', 'GP', 'AB', 'R', 'H', '2B', '3B', 'HR', 'TB', 'RBI', 'AVG', 'OBP', 'SLG', 'OPS']) dfplot = df[['TEAM', 'R', 'H', 'TB']].copy() fig = plt.figure() ax = fig.add_subplot(111) width = 0.4 dfplot.plot(kind='bar', x='TEAM', ax=ax, width=width, position=1) plt.show() This creates the following output:
How to print the n-th elements of arrays with matching first items?
I am working in Python and I have the following data: ['DDX58_HUMAN', 'gnl|CDD|256537', '819', '923'] ['DDX58_HUMAN', 'gnl|CDD|260076', '111', '189'] ['DDX58_HUMAN', 'gnl|CDD|260076', '4', '93'] ['DDX58_HUMAN', 'gnl|CDD|238005', '258', '410'] ['DDX58_HUMAN', 'gnl|CDD|238034', '606', '741'] ['DICER_HUMAN', 'gnl|CDD|239209', '886', '1008'] ['DICER_HUMAN', 'gnl|CDD|238333', '1681', '1846'] ['DICER_HUMAN', 'gnl|CDD|238333', '1296', '1376'] ['DICER_HUMAN', 'gnl|CDD|238333', '1547', '1583'] ['DICER_HUMAN', 'gnl|CDD|251903', '630', '722'] ['DICER_HUMAN', 'gnl|CDD|238005', '58', '209'] ['DICER_HUMAN', 'gnl|CDD|238034', '444', '553'] I need to print the 2nd, 3rd, and 4th items after matching first items like this: DDX58_HUMAN gnl|CDD|256537 819 923 gnl|CDD|260076 111 189 gnl|CDD|260076 4 93 gnl|CDD|238005 258 410 gnl|CDD|238034 606 741 DICER_HUMAN gnl|CDD|239209 886 1008 gnl|CDD|238333 1681 1846 gnl|CDD|238333 1296 1376 gnl|CDD|238333 1547 1583 gnl|CDD|251903 630 722 gnl|CDD|238005 58 209 gnl|CDD|238034 444 553 How can I achieve this?
Here is a sample code for what you want to do : I assume you have this data in python lists you can just traverse each list , and store values in a dictionary based on first element of the list and you will be able to get the unique entries. mylist = [['DDX58_HUMAN', 'gnl|CDD|256537', '819', '923'] ,['DDX58_HUMAN', 'gnl|CDD|260076', '111', '189'] ,['DDX58_HUMAN', 'gnl|CDD|260076', '4', '93'] ,['DDX58_HUMAN', 'gnl|CDD|238005', '258', '410'] ,['DDX58_HUMAN', 'gnl|CDD|238034', '606', '741'] ,['DICER_HUMAN', 'gnl|CDD|239209', '886', '1008'] ,['DICER_HUMAN', 'gnl|CDD|238333', '1681', '1846'] ,['DICER_HUMAN', 'gnl|CDD|238333', '1296', '1376'] ,['DICER_HUMAN', 'gnl|CDD|238333', '1547', '1583'] ,['DICER_HUMAN', 'gnl|CDD|251903', '630', '722'] ,['DICER_HUMAN', 'gnl|CDD|238005', '58', '209'] ,['DICER_HUMAN', 'gnl|CDD|238034', '444', '553']] myDict = {} for items in mylist : myDict.setdefault(items[0],[]).append(" ".join(x for x in items[1:])) for k,v in myDict.items(): print(k," : "," ".join(x for x in v)) Output DDX58_HUMAN : gnl|CDD|256537 819 923 gnl|CDD|260076 111 189 gnl|CDD|260076 4 93 gnl|CDD|238005 258 410 gnl|CDD|238034 606 741 DICER_HUMAN : gnl|CDD|239209 886 1008 gnl|CDD|238333 1681 1846 gnl|CDD|238333 1296 1376 gnl|CDD|238333 1547 1583 gnl|CDD|251903 630 722 gnl|CDD|238005 58 209 gnl|CDD|238034 444 553 If your data is in .txt file : Just read the text file and remove unwanted braces using re module and then same above mentioned logic will work. import re with open("data.txt") as mylist : myDict = {} mainList = [] for items in mylist.readlines() : dataString = re.sub(r"[\[[\]]","",items.rstrip()).split(",") mainList.append(dataString) myDict = {} for items in mainList : myDict.setdefault(items[0],[]).append("".join(x for x in items[1:])) for k,v in myDict.items(): print(k," : ","".join(x for x in v)) Output : 'DICER_HUMAN' : 'gnl|CDD|239209' '886' '1008' 'gnl|CDD|238333' '1681' '1846' 'gnl|CDD|238333' '1296' '1376' 'gnl|CDD|238333' '1547' '1583' 'gnl|CDD|251903' '630' '722' 'gnl|CDD|238005' '58' '209' 'gnl|CDD|238034' '444' '553' 'DDX58_HUMAN' : 'gnl|CDD|256537' '819' '923' 'gnl|CDD|260076' '111' '189' 'gnl|CDD|260076' '4' '93' 'gnl|CDD|238005' '258' '410' 'gnl|CDD|238034' '606' '741'
How to convert from String to List/Array .?
I have 3 strings a=38 186 298 345 0.93345 27 198 277 389 0.86006 33 127 293 354 0.89782 Type(a) len(a) shows it as 22 including splace between 2 numbers Want to convert them to list Need as below b=[[38 186 298 345 0.93345][27 198 277 389 0.86006][33 127 293 354 0.89782]]
Is this what you aim for: a = '''38 186 298 345 0.93345 27 198 277 389 0.86006 33 127 293 354 0.89782''' b = [line.split() for line in a.split('\n')] b #[['38', '186', '298', '345', '0.93345'], # ['27', '198', '277', '389', '0.86006'], # ['33', '127', '293', '354', '0.89782']]
Split them by newline and spaces. Use this python function. string_name.split(str="") For more info: https://www.tutorialspoint.com/python/string_split.htm
This is one solution specific to your data. Note your inputs are not valid Python, I have resolved that below. a1 = '38 186 298 345 0.93345' a2 = '27 198 277 389 0.86006' a3 = '33 127 293 354 0.89782' res = [[float(j) if float(j) < 1 else int(j) for j in i.split()] \ for i in [a1, a2, a3]] # [[38, 186, 298, 345, 0.93345], # [27, 198, 277, 389, 0.86006], # [33, 127, 293, 354, 0.89782]]