Column lists into string - python

I have a dataset that looks like this:
id keyPhrases
0 [word1, word2]
1 [word4, word 5 and 6, word7]
2 [word8, etc, etc
Each value in 'keyPhrases' is a list.
I'd like to expand each list into a new row (string)
The 'id' column is not important right now.
Already tried df.values, from_records, etc
Expected:
keyPhrases
word1
word2
word3
word4

You can use itertools.chain in combination with dataframe column selection:
import itertools
df = pd.DataFrame({
'keyPhrases': [
['word1', 'word2'],
['word4', 'word5', 'word7'],
['word8', 'word9']
],
'id': [1,2,3]
})
for elem in itertools.chain.from_iterable(df['keyPhrases'].values):
print(elem)
will print:
word1
word2
word4
word5
word7
word8
word9

np.concatenate()
np.concatenate(df.keyPhrases) #data courtesy vurmux
array(['word1', 'word2', 'word4', 'word5', 'word7', 'word8', 'word9'],
dtype='<U5')
Another way:
import functools
import operator
functools.reduce(operator.iadd, df.keyPhrases, [])
#['word1', 'word2', 'word4', 'word5', 'word7', 'word8', 'word9']

A fun way but not recommended
df.keyPhrases.sum()
Out[520]: ['word1', 'word2', 'word4', 'word5', 'word7', 'word8', 'word9']

keyPhrases = df.keyPhrases.tolist()
reduce(lambda x, y: x+y, keyPhrases)

Both the numpy and the itertools methods worked pretty fine.
I ended up using the itertools method and used the for to write each line to a file.
It saved me a lot of time and code.
Thanks a lot!!
for elem in itertools.chain.from_iterable(df['keyPhrases'].values):
textfile.write(elem + "\n")

I am not sure about any existing functions which could do this in single line of code. The work around code below can solve your requirement. If there are any other built-in functions that can get this done without struggle, I will be glad to know.
import pandas as pd
#Existing DF where the data is in the form of list
df = pd.DataFrame(columns=['ID', 'value_list'])
#New DF where the data should be atomic
df_new = pd.DataFrame(columns=['ID', 'value_single'])
#Sample Data
row_1 = ['A', 'B', 'C', 'D']
row_2 = ['D', 'E', 'F']
row_3 = ['F', 'G']
row_4 = ['H', 'I']
row_5 = ['J']
#Data Push to existing DF
row_ = "row_"
for i in range(5):
df.loc[i, 'ID'] = i
df.loc[i, 'value_list'] = eval(row_+str(i+1))
#Data Push to new DF where list is pushed as atomic data
counter = 0
i=0
while(i<len(df)):
j=0
while(j<len(df['value_list'][i])):
df_new.loc[counter, 'ID'] = df['ID'][i]
df_new.loc[counter, 'value_single'] = df['value_list'][i][j]
counter = counter + 1
j = j+1
i = i+1
print(df_new)
This link could help with your requirement.

Found another way to do:
df['keyPhrases'] = df['keyPhrases'].str.split(',') #to make arrays
df['keyPhrases'] = df['keyPhrases'].astype(str) #back to strings
s=''.join(df.keyPhrases).replace('[','').replace(']','\n').replace(',','\n') #replace magic
print(s)
word1
word2
word4
word 5 and 6
word7
word8
etc
etc

The answer given above for the numpy library really is very good, but I participate by putting a code trellis, not performatic, but in the simplest way to understand.
import pandas as pd
lista = [[['word1', 'word2']], [['word4', 'word5', 'word6', 'word7']], [['word8', 'word9', 'word10']]]
df = pd.DataFrame(lista, columns=['keyPhrases'])
list = []
for key in df.keyPhrases:
for element in key:
list.append(element)
list

Related

How to filter a row and remove those values from a column?

I have the next DataFrame:
x = [{'name': 'a.1,b,c,d,e,a,f,g,h'}, {'name': 'b.1,c,a.1,d,e.1,g,a,h'}, {'name': 'b.1,d,e,a,f.1,c,r'}]
df = pd.DataFrame(x)
I need to filter column -> 'name' and delete the values a and a.1, and to get the next result:
I will be grateful for help
You can use list comprehension:
df['name_1'] = [','.join([a for a in d.split(',') if a not in set(('a','a.1'))]) for d in df['name']]
You can also apply the very same function to the 'name' column:
df['name_1'] = df['name'].apply(lambda row : ','.join([a for a in row.split(',') if a not in set(('a','a.1'))]))
Output:
name name_1
0 a.1,b,c,d,e,a,f,g,h b,c,d,e,f,g,h
1 b.1,c,a.1,d,e.1,g,a,h b.1,c,d,e.1,g,h
2 b.1,d,e,a,f.1,c,r b.1,d,e,f.1,c,r
You can use Dataframe's apply function this:
def update(r):
items = r.split(',')
return ','.join(filter(lambda x: x != 'a.1' and x != 'a', items))
df['name_1'] = df['name'].apply(update)
df
Output:
name name_1
0 a.1,b,c,d,e,a,f,g,h b,c,d,e,f,g,h
1 b.1,c,a.1,d,e.1,g,a,h b.1,c,d,e.1,g,h
2 b.1,d,e,a,f.1,c,r b.1,d,e,f.1,c,r
You can just use replace:
df['name'] = df['name'].replace({'a.1,':'','a':''},regex=True).str.replace(',,',',')
name
0 b,c,d,e,f,g,h
1 b.1,c,d,e.1,g,h
2 b.1,d,e,f.1,c,r
Another regex option:
df['name'].str.replace(r"a\.1|a","").str.replace(r'(?<=,),|^,',"")
0 b,c,d,e,f,g,h
1 b.1,c,d,e.1,g,h
2 b.1,d,e,f.1,c,r
Try this:
df['name_1'] = df['name'].str.split(',').apply(lambda cell_value: ",".join(filter(lambda x: x not in {'a', 'a.1'}, cell_value)))
I am surprised that all proposed regex answers use two passes to achieve the goal.
Here is a regex answer that relies on a single expression:
# list of substrings to remove
# IMPORTANT: if an element is a left-aligned substring of another
# the substring should be ordered after the longer string
# here "a" is after "a.1"
remove = ['a.1', 'a']
import re
regex = '({0}),|,({0})'.format('|'.join(map(re.escape, remove)))
df['name_1'] = df['name'].str.replace(regex, '', regex=True)
Alternative if group capturing is unwanted:
regex = '(?:{0}),|,(?:{0})'.format('|'.join(map(re.escape, remove)))
Example:
name name_1
0 a.1,b,c,d,e,a,f,g,h b,c,d,e,f,g,h
1 b.1,c,a.1,d,e.1,g,a,h,a b.1,c,d,e.1,g,h
2 b.1,d,e,a,f.1,c,r b.1,d,e,f.1,c,r
Format of the generated regex:
'(a\\.1|a),|,(a\\.1|a)'
# or for the alternative
'(?:a\\.1|a),|,(?:a\\.1|a)'

How to remove the matched values in two lists separately in python?

I am facing a issue with my for loop in respect to lists.
I have two lists like shown below. Now I want to remove the name if the name in both lists matches. My code
Input:
col = ['cat','dog','bird','fish']
col_names= [cat,bird]
r=[]
for i in col:
print(i)
if i in col_names: col_names.remove(i)
r.append(col_names)
print(r)
then I am getting an output like this
r = [['dog','fish']] [['dog','fish']]
What I want is:
r =['dog','bird','fish'] ['cat','dog','fish']
Simpler way to achieve this is using nested list comprehension :
>>> col = ['cat','dog','bird','fish']
>>> col_names= ['cat', 'bird']
>>> [[c for c in col if c !=cn] for cn in col_names]
[['dog', 'bird', 'fish'], ['cat', 'dog', 'fish']]
The code you shared is not logically correct. If you want to do it with explicit for loop, you can write it like this:
new_list = []
for cn in col_names:
temp_list = []
for c in col:
if c != cn:
temp_list.append(c)
new_list.append(temp_list)
print(new_list)
The problem here is col is edited everytime you remove an element from it because it is a pointer. If you want to achieve your desired output, you should create a copy first, like the following
col = ['cat','dog','bird','fish']
col_names= ['cat','bird','elephant']
r=[]
for name in col_names:
tmp = col.copy() # Creates a copy of col list
try:
tmp.remove(name)
r.append(tmp)
except ValueError: # If name is not in the list
pass # Don't do anything
print(r)
Output
[['dog', 'bird', 'fish'], ['cat', 'dog', 'fish']]

Remove dictionaries in a list of dictionaries that matches a pattern using list comprehension

I have a list of dictionaries with text as value, and I want to remove the dictionaries that includes certain words in the texts.
df = [{'name':'jon','text':'the day is light'},{'name':'betty','text':'good night'},{'name':'shawn','text':'good afternoon'}]
I want to remove the dictionaries that include words 'light' and 'night' for 'text' key:
words = ['light','night']
pattern = re.compile(r"|".join(words))
Expected result:
df = [{'name':'shawn','text':'good afternoon'}]
[x for x in df if not any(w in x['text'] for w in words)]
You're close. All you need to do is write your list comprehension and apply the search pattern:
result = [x for x in df if not re.search(pattern, x['text'])]
Full example:
import re
df = [{'name':'jon','text':'the day is light'},{'name':'betty','text':'good night'},{'name':'shawn','text':'good afternoon'}]
words = ['light','night']
pattern = re.compile(r"|".join(words))
result = [x for x in df if not re.search(pattern, x['text'])]
print(result) # => [{'name': 'shawn', 'text': 'good afternoon'}]
I found my answer:
[x for x in df if not pattern.search(x['text'])]

Sensing Two Consecutive Strings in String

If I have aaabbbccc, I'd like to change them in to a3b3c3.
I am using if statement for this.. but it looks too inefficient.
Maybe Regex would be helpful, but regex for searching only the consecutives is possible?
if I have aaabbbcccaaa then I'd like to change them a3b3c3a3 list this.. which means the algorithm only search the "consecutives and count them" change into integer.
Any hint to proceed would be appreciated.
def comp(string):
index = []
for i in range(len(string)):
try:
if string[i] is not string[i+1]:
index.append(i)
except:
pass
first = string[index[0]] + str(index[0]+1)
print(first)
message_comp = [first]
for i in range(1, len(message_comp)):
message_comp.append(message[index[i]]*(index[i-1]+1))
final = ''.join(message_comp)
return final
itertools.groupby:
Make an iterator that returns consecutive keys and groups from the iterable
import itertools
x = 'aaabbbcccaaa'
groups = [i + str(len([*j])) for i, j in itertools.groupby(x)]
# ['a3', 'b3', 'c3', 'a3']
join to finish up:
''.join(groups)
# a3b3c3a3
If needed, replace to remove 1:
''.join(groups).replace('1', '') instead of ''.join(groups)
Maybe itertools groupby?
from itertools import groupby
s = "aaabbbcccaaa"
groups = groupby(s)
a = [(label, sum(1 for _ in group)) for label, group in groups]
b = [i for sub in a for i in sub]
print("".join(map(str,b)))
output: a3b3c3a3

Convert excel table to key list

I want to create a key list for a large hdf file from an Excel file:
Table:
I want the key list to look like this:
/blue/a/a1
/blue/a/a2
/blue/b/b1
...
my code so far is:
import pandas as pd
import numpy as np
df = pd.read_excel('file.xlsx', usecols = ['A', 'B', 'C'])
print(df)
list1, list2, list3 = df['A'].tolist(), df['B'].tolist(), df['C'].tolist()
print(list1,list2,list3)
for i in list1:
list1[i].append(list2[i]).append(list3[i])
print(list1)
The conversion to 3 lists works. Then I try to append the rows of each list together, without success. Is there a simple way to do that?
Use zip method and then use string.join to get your required output
Ex:
res = []
list1, list2, list3 = df['A'].tolist(), df['B'].tolist(), df['C'].tolist()
for i in zip(list1, list2, list3 ):
val = map(str, i)
res.append("/{0}".format("/".join(val)))
You should use a apply function to create a new column as required and then convert it to a list
import pandas as pd
import numpy as np
df = pd.read_excel('file.xlsx', usecols = ['A', 'B', 'C'])
print(df)
x=df[['A','B','C']].apply(lambda row: '/'+row['A']+'/'+row['B']+'/'+row['C'],axis=1)
x.tolist()
Update shorter code:
x=df[['A','B','C']].apply(lambda row: '/'+'/'.join(row),axis=1)
x.tolist()

Categories

Resources