Splitting list of strings in a column of vaex dataframe

Splitting list of strings in a column of vaex dataframe - python

There is a vaex dataframe with a column such as:
df['col']
['aa', ' NO']
['aa', ' NO']
['aa', ' NO']
['aa', ' NO']
['aa', ' NO']
I want to convert this one column to two columns as follow:
df['col1', 'col2']
['aa'], [' NO']
['aa'], [' NO']
['aa'], [' NO']
['aa'], [' NO']
['aa'], [' NO']
Is there any way to do that in Vaex?

I do like that (not very clean but ok. Maybe You can use find method to if you dont know where is str word start or end ):
df.head(10)
>>> col
>>> 0 ['aa', 'NO']
>>> 1 ['aa', 'NO']
>>> 2 ['aa', 'NO']
>>> 3 ['aa', 'NO']
>>> 4 ['aa', 'NO']
df['col1'] = [[x[1:5]] for x in df['col']]
df['col2'] = [[x[7:11]] for x in df['col']]
df.head(10)
>>> col col1 col2
>>> 0 ['aa', 'NO'] ['aa'] ['NO']
>>> 1 ['aa', 'NO'] ['aa'] ['NO']
>>> 2 ['aa', 'NO'] ['aa'] ['NO']
>>> 3 ['aa', 'NO'] ['aa'] ['NO']
>>> 4 ['aa', 'NO'] ['aa'] ['NO']

Related

creating a list of dictionaries from pandas dataframe

This is my df:
df = pd.DataFrame({'sym': ['a', 'b', 'c', 'x', 'y', 'z', 'q', 'w', 'e'],
'sym_t': ['tsla', 'msft', 'f', 'aapl', 'aa', 'gg', 'amd', 'ba', 'c']})
I want to separate this df into groups of three and create a list of dictionaries:
options = [{'value':'a b c', 'label':'tsla msft f'}, {'value':'x y z', 'label':'aapl aa gg'}, {'value':'q w e', 'label':'amd ba c'}]
How can I create that list? My original df has over 1000 rows.

Try groupby to concatenate the rows, then to_dict:
tmp = df.groupby(np.arange(len(df))//3).agg(' '.join)
tmp.columns = ['value', 'label']
tmp.to_dict(orient='records')
Output:
[{'value': 'a b c', 'label': 'tsla msft f'},
{'value': 'x y z', 'label': 'aapl aa gg'},
{'value': 'q w e', 'label': 'amd ba c'}]

dataframe remove last digit from string if it is number

python dataframe
I want to delete the last character if it is number.
from current dataframe
data = {'d':['AAA2', 'BB 2', 'C', 'DDD ', 'EEEEEEE)', 'FFF ()', np.nan, '123456']}
df = pd.DataFrame(data)
to new dataframe
data = {'d':['AAA2', 'BB 2', 'C', 'DDD ', 'EEEEEEE)', 'FFF ()', np.nan, '123456'],
'expected': ['AAA', 'BB', 'C', 'DDD', 'EEEEEEE)', 'FFF (', np.nan, '12345']}
df = pd.DataFrame(data)
df
ex

Using .str.replace:
df['d'] = df['d'].str.replace(r'(\d)$','',regex=True)

Comparing lists elements to sublist elements in Pandas

df
col1 col2
['aa', 'bb', 'cc', 'dd'] [['ee', 'ff', 'gg', 'hh'], ['qq', 'ww', 'ee', 'rr']]
['ss', 'dd', 'ff', 'gg'] [['mm', 'nn', 'vv', 'cc'], ['zz', 'aa', 'jj', 'kk']]
['ss', 'dd'] [['mm', 'nn', 'vv', 'cc'], ['zz', 'aa', 'jj', 'kk']]
I'd like to be able to run a function that concats the first list element in col1 to the first sublist elements (there are multiple sublists) in col2, then concats the second list element in col1 to the second sublist elements in col2.
Results would be like this column:
results
[['aaee', 'bbff', 'ccgg', 'ddhh'],['aaqq', 'bbww', 'ccee', 'ddrr']]
[['ssmm', 'ddnn', 'ffvv', 'ggcc'],['sszz', 'ddaa', 'ffjj', 'ggkk']]
[['ssmm', 'ddnn'],['sszz', 'ddaa']]
I'm thinking it would have something to do with looping through the first elements in col1 and somehow loop and match them to the corresponding items in each sublist in col2 - how can I do this?
Converted code
[[[df1.agg(lambda x: get_top_matches(u,w), axis=1) for u,w in zip(x,v)]\
for v in y] for x,y in zip(df1['parent_org_name_list'], df1['children_org_name_sublists'])]
Results:

You can just use zip here:
[[[u+w for u,w in zip(x,v)] for v in y] for x,y in zip(df['col1'], df['col2'])]
Output:
[[['aaee', 'bbff', 'ccgg', 'ddhh'], ['aaqq', 'bbww', 'ccee', 'ddrr']],
[['ssmm', 'ddnn', 'ffvv', 'ggcc'], ['sszz', 'ddaa', 'ffjj', 'ggkk']],
[['ssmm', 'ddnn'], ['sszz', 'ddaa']]]
To assign back to your dataframe, you can do:
df['results'] = [[[u+w for u,w in zip(x,v)] for v in y]
for x,y in zip(df['col1'], df['col2'])]

Max, try this solution with a cycle. It allows finer control over transformations, including dealing with uneven lengths (see len_limit in the example):
import pandas as pd
df = pd.DataFrame({'c1':[['aa', 'bb', 'cc', 'dd'],['ss', 'dd', 'ff', 'gg']],
'c2':[[['ee', 'ff', 'gg', 'hh'], ['qq', 'ww', 'ee', 'rr']],
[['mm', 'nn', 'vv', 'cc'], ['zz', 'aa', 'jj', 'kk']]],})
df ['c3'] = 'empty' # send string to 'c3' so it is object data type
print(df)
c1 c2 c3
0 [aa, bb, cc, dd] [[ee, ff, gg, hh], [qq, ww, ee, rr]] empty
1 [ss, dd, ff, gg] [[mm, nn, vv, cc], [zz, aa, jj, kk]] empty
for i, row in df.iterrows():
c3_list = []
len_limit = len (row['c1']
for c2_sublist in row['c2']:
c3_list.append([j1+j2 for j1, j2 in zip(row['c1'], c2_sublist[:len_limit])])
df.at[i, 'c3'] = c3_list
print (df['c3'])
0 [[aaee, bbff, ccgg, ddhh], [aaqq, bbww, ccee, ...
1 [[ssmm, ddnn, ffvv, ggcc], [sszz, ddaa, ffjj, ...
Name: c3, dtype: object

Try:
df["results"] = df[["col1", "col2"]].apply(lambda x: [list(map(''.join, zip(x["col1"], el))) for el in x["col2"]], axis=1)
Outputs:
>>> df["results"]
0 [[aaee, bbff, ccgg, ddhh], [aaqq, bbww, ccee, ...
1 [[ssmm, ddnn, ffvv, ggcc], [sszz, ddaa, ffjj, ...
2 [[ssmm, ddnn], [sszz, ddaa]]

replace duplicate values in a list with white space

Say I have a sorted list, and I want to keep each value in the list for once.
a = ['aa', 'aa', 'aa', 'bb', 'bb', 'cc']
shall be converted into
a = ['aa', ' ', ' ', 'bb', ' ', 'cc']
It seems to be a very odd request. The reason behind this is I want a unique label list for my seaborn heatmap for xticklabel. The length of my list is very long (>1000). If I plot every value in my list, the plot will be a disaster.

If the list is sorted, the simplest is to use itertools.groupby to convert every subsequence, then stitch them together:
from itertools import groupby
new_a = [x for k, v in groupby(a) for x in [k] + [' '] * (sum(1 for __ in v) - 1)]

Here's another approach with easier readability.
org = None
a = ['aa', 'aa', 'aa', 'bb', 'bb', 'cc']
for i in range(len(a)):
if a[i] == org:
a[i] = " "
else:
org = a[i]
print(a)
Output:
['aa', ' ', ' ', 'bb', ' ', 'cc']

One way is to use counters
In [26]: a
Out[26]: ['aa', 'aa', 'aa', 'bb', 'bb', 'cc']
In [27]: from collections import Counter
In [28]: data = []
In [29]: for i in counter:
...: data.append(i)
...: data.extend([" "] * (counter[i] - 1))
...:
...:
In [30]: data
Out[30]: ['aa', ' ', ' ', 'bb', ' ', 'cc']

a = ['aa', 'aa', 'aa', 'bb', 'bb', 'cc']
newlist = []
for i in a:
if i not in newlist:
newlist.append(i)
else:
newlist.append('')
print(newlist)
>> ['aa', '', '', 'bb', '', 'cc']

First, create a new list,
new_a = []
Then, ignore all the other occurrences of that particular element and replace it with whitespaces
for i in a:
if i not in new_a:
new_a.append(i)
else:
new_a.append(" ")
print(new_a)
Output :
>> ['aa', ' ', ' ', 'bb', ' ', 'cc']

How to create sub list with fixed length from given number of inputs or list in Python?

I want to create sub-lists with fixed list length, from given number of inputs in Python.
For example, my inputs are: ['a','b','c',......'z']... Then I want to put those values in several lists. Each list length should be 6. So I want something like this:
first list = ['a','b','c','d','e','f']
second list = ['g','h','i','j','k','l']
last list = [' ',' ',' ',' ',' ','z' ]
How can I achieve this?

The smallest solution:
x = ["a","b","c","d","e","f","g","h","i","j"]
size = 3 (user input)
for counter in range(0,len(x),size):
print(x[counter:counter+size])

This will split your list into 2 lists of equal length (6):
>>> my_list = [1, 'ab', '', 'No', '', 'NULL', 2, 'bc', '','Yes' ,'' ,'Null']
>>> x = my_list[:len(my_list)//2]
>>> y = my_list[len(my_list)//2:]
>>> x
[1, 'ab', '', 'No', '', 'NULL']
>>> y
[2, 'bc', '', 'Yes', '', 'Null']
If you want to split a list to many smaller lists use:
chunks = [my_list[x:x+size] for x in range(0, len(my_list), size)]
Where size is the size of the smaller lists you want, example:
>>> size = 2
>>> chunks = [my_list[x:x+size] for x in range(0, len(my_list), size)]
[[1, 'ab'], ['', 'No'], ['', 'NULL'], [2, 'bc'], ['', 'Yes'], ['', 'Null']]
>>> for item in chunks:
print (item)
[1, 'ab']
['', 'No']
['', 'NULL']
[2, 'bc']
['', 'Yes']
['', 'Null']

Your input is a string, and you need to split it first by comma, and then divide it further:
input_string = "1, 'ab', '', 'No', '', 'NULL', 2, 'bc', '','Yes' ,'' ,'Null'"
bits = input_string.split(',')
x,y = bits[:6],bits[6:] # divide by 6
x,y = bits[:len(bits)//2],bits[len(bits)//2:] # divide in half

This returns a 2d list "b" that contains as many entries per list as chunksize is big.
a = ["a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"]
b = []
chunksize = 6
def get_list(a, chunk):
return a[chunk*chunksize:chunk*chunksize+chunksize]
for i in range(int(len(a) / chunksize)):
b.append(get_list(a,i))
print(b)
Output:
[['a', 'b', 'c', 'd', 'e', 'f'], ['g', 'h', 'i', 'j', 'k', 'l'], ['m', 'n', 'o', 'p', 'q', 'r'], ['s', 't', 'u', 'v', 'w', 'x']]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Splitting list of strings in a column of vaex dataframe - python

Related

creating a list of dictionaries from pandas dataframe

dataframe remove last digit from string if it is number

Comparing lists elements to sublist elements in Pandas

replace duplicate values in a list with white space

How to create sub list with fixed length from given number of inputs or list in Python?

Categories

Resources