Extracting multiple words from pandas dataframe column into same column - python

Suppose a dataframe consists of two columns A={1,2,3} B={'a b c d', 'e f g h', 'i j k l'}. For A = 2, I would like to change the corresponding entry in column B to 'e f h'. (ie. extract the first, second and last word, not drop the third word, not the same).
It is easy to extract single words using the df.loc[df['colA']=2,'colB'].str.split().str[x], where x= 0,1 and -1, but I'm having difficulty joining the three words back into one string efficiently. The most efficient way I can think of is provided below. Is there a better way of achieving what I'm trying to do? Thanks.
y = lambda x : df.loc[df['colA']==2,'colB'].str.split().str[x]
df.loc[df['colA']=2,'colB'] = y(0) + ' ' + y(1) + ' ' + y(-1)
Expected and actual result:
A B
1 a b c d
2 e f h
3 i j k l

You were pretty close to the solution, the only problem is that str[x] returns a value wrapped in a Series object. You could fix this by extracting the value from the Series as shown:
y = lambda x : df.loc[df['colA']==2,'colB'].str.split().str[x].values[0]
df.loc[df['colA']==2,'colB'] = y(0) + ' ' + y(1) + ' ' + y(-1)
You can also achieve the same by making use of the apply function
df.loc[df['colA']==2, 'colB'] = df.loc[df['colA']==2,'colB'].apply(lambda x: ' '.join(x.split()[0:2] + [x.split()[-1]]))

How about this:
df = pd.DataFrame(data = {'A': [1,2,3],
'B': ['a b c d', 'e f g h', 'i j k l']})
y = lambda x : df.loc[df['A']==2,'B'].str[0:2*x+2] + df.loc[df['A']==2,'B'].str[-1]
df.loc[df1['A']==2,'B'] = y(1)
Then df is the wanted:
A B
0 1 a b c d
1 2 e f h
2 3 i j k l

Related

How to discard multiple elements from a set?

I am trying to discard elements with length less than 10, but it doesn't work.
a = {'ab', 'z x c v b n m k l j h g f f d s a a', 'q w e r t y u i o p'}
a.discard(x for x in a if len(x.split())<9) # discard elements with length<10
print(a)
I got this output:
{'z x c v b n m k l j h g f f d s a a', 'q w e r t y u i o p', 'ab'}
'ab' doesn't match the condition, I don't know why it's still here?
And my desired output is:
{'z x c v b n m k l j h g f f d s a a', 'q w e r t y u i o p'}
You need to call discard on individual items, not on the generator of items to be discarded:
for x in [x for x in a if len(x.split()) < 9]:
a.discard(x)
Be mindful that you can't discard items while iterating through the set, so this will not work:
for x in a:
if len(x.split()) < 9:
a.discard(x)
Although this is beyond your question, I'd like to add that there are better ways to do what you want through set comprehension or set subtraction as suggested in another answer and comments.
You are using the wrong method to remove elements from the set. discard removes an element from the set only if it exists. You want to remove elements based on a condition, so you need to use a different approach. Here's a corrected version of the code:
a = {'ab', 'z x c v b n m k l j h g f f d s a a', 'q w e r t y u i o p'}
a = {x for x in a if len(x.split()) >= 9}
print(a)
This code creates a new set with only the elements that meet the condition, and then assigns it back to a. The desired output is achieved:
{'z x c v b n m k l j h g f f d s a a', 'q w e r t y u i o p'}

Is it possible to split a column value and add a new column at the same time for dataframe?

I have a dataframe with some columns delimited with '|', and I need to flatten this dataframe. Example:
name type
a l
b m
c|d|e n
For this df, I want to flatten it to:
name type
a l
b m
c n
d n
e n
To do this, I used this command:
df = df.assign(name=df.name.str.split('|')).explode(column).drop_duplicates()
Now, I want do one more thing besides above flatten operation:
name type co_occur
a l
b m
c n d
c n e
d n e
That is, not only split the 'c|d|e' into two rows, but also create a new column which contains a 'co_occur' relationship, in which 'c' and 'd' and 'e' co-occur with each other.
I don't see an easy way to do this by modifying:
df = df.assign(name=df.name.str.split('|')).explode(column).drop_duplicates()
I think this is what you want. Use combinations and piece everything together
from itertools import combinations
import io
data = '''name type
a l
b m
c|d|e n
j|k o
f|g|h|i p
'''
df = pd.read_csv(io.StringIO(data), sep=' \s+', engine='python')
# hold the new dataframes as you iterate via apply()
df_hold = []
def explode_combos(x):
combos = list(combinations(x['name'].split('|'),2))
# print(combos)
# print(x['type'])
df_hold.append(pd.DataFrame([{'name':c[0], 'type':x['type'], 'co_cur': c[1]} for c in combos]))
return
# only apply() to those rows that need to be exploded
dft = df[df['name'].str.contains('\|')].apply(explode_combos, axis=1)
# concatenate the result
dfn = pd.concat(df_hold)
# add back to rows that weren't operated on (see the ~)
df_final = pd.concat([df[~df['name'].str.contains('\|')], dfn]).fillna('')
name type co_cur
0 a l
1 b m
0 c n d
1 c n e
2 d n e
0 j o k
0 f p g
1 f p h
2 f p i
3 g p h
4 g p i
5 h p i

How to split a dataframe string column into multiple columns?

I have a pandas dataframe. This dataframe consists of a single column. I want to parse this column according to the '&' sign and add the data to the right of the "=" sign as a new column. Examples are below.
The dataframe I have;
tags
0 letter1=A&letter2=B&letter3=C
1 letter1=D&letter2=E&letter3=F
2 letter1=G&letter2=H&letter3=I
3 letter1=J&letter2=K&letter3=L
4 letter1=M&letter2=N&letter3=O
5 letter1=P&letter2=R&letter3=S
. .
. .
dataframe that I want to convert;
letter1 letter2 letter3
0 A B C
1 D E F
2 G H I
3 J K L
4 M N O
.
.
I tried to do something with this code snippet.
columnname= df["tags"][0].split("&")[i].split("=")[0]
value =df["tags"][0].split("&")[i].split("=")[1]
But I'm not sure how I can do it for the whole dataframe. I am looking for a faster and stable way.
Thanks in advance,
do this..
import pandas as pd
tags = [
"letter1=A&letter2=B&letter3=C",
"letter1=D&letter2=E&letter3=F",
"letter1=G&letter2=H&letter3=I",
"letter1=J&letter2=K&letter3=L",
"letter1=M&letter2=N&letter3=O",
"letter1=P&letter2=R&letter3=S"
]
df = pd.DataFrame({"tags": tags})
df["letter1"] = df["tags"].apply(lambda x: x.split("&")[0].split("=")[-1])
df["letter2"] = df["tags"].apply(lambda x: x.split("&")[1].split("=")[-1])
df["letter3"] = df["tags"].apply(lambda x: x.split("&")[2].split("=")[-1])
df = df[["letter1", "letter2", "letter3"]]
df
Split into separate columns, via str.split, using & :
step1 = df.tags.str.split("&", expand=True)
Get the new columns from the first row of step1:
new_columns = step1.loc[0, :].str[:-2].array
Get rid of the letter1= prefix in each column, set the new_columns as the header:
step1.set_axis(new_columns, axis='columns').transform(lambda col: col.str[-1])
letter1 letter2 letter3
0 A B C
1 D E F
2 G H I
3 J K L
4 M N O
5 P R S
d=list(df["tags"])
r={}
for i in d:
for ele in i.split("&"):
if ele.split("=")[0] in r.keys():
r[ele.split("=")[0]].append(ele.split("=")[1])
else:
r[ele.split("=")[0]]=[]
r[ele.split("=")[0]].append(ele.split("=")[1])
df = pd.DataFrame({i:pd.Series(r[i]) for i in r})
print (df)
Using regex
import pandas as pd
import re
tags = [
"letter1=A&letter2=B&letter3=C",
"letter1=D&letter2=E&letter3=F",
"letter1=G&letter2=H&letter3=I",
"letter1=J&letter2=K&letter3=L",
"letter1=M&letter2=N&letter3=O",
"letter1=P&letter2=R&letter3=S"
]
df = pd.DataFrame({"tags": tags})
pattern=re.compile("\=(\w+)") # Look for pattern
df['letter1'], df['letter3'],df["letter2"] = zip(*df["tags"].apply(lambda x: pattern.findall(x)))
Output
tags letter1 letter2 letter3
0 letter1=A&letter2=B&letter3=C A B C
1 letter1=D&letter2=E&letter3=F D E F
2 letter1=G&letter2=H&letter3=I G H I
3 letter1=J&letter2=K&letter3=L J K L
4 letter1=M&letter2=N&letter3=O M N O
5 letter1=P&letter2=R&letter3=S P R S

Nested loop results in table, Python

I need to loop a computation over two lists of elements and save the results in a table. So, say that
months = [1,2,3,4,5]
Region = ['Region1', 'Region2']
and that my code is of the type
df=[]
for month in month:
for region in Region:
code
x = result
df.append(x)
What I cannot achieve is rendering the final result in a table in which the rows are regions and coumns are months
1 2 3 4 5
Region1 a b c d e
Region2 f g h i j
Assuming that there is the right numbers of items in result
result = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
months = [1, 2, 3, 4, 5]
Region = ['Region1', 'Region2']
df = pd.DataFrame([[Region[i]] + result[i*len(months): ((i+1)*len(months))] for i in range(len(Region))], columns=["Region"] + months).set_index("Region")
Output
1 2 3 4 5
Region
Region1 a b c d e
Region2 f g h i j
This part
[[Region[i]] + result[i*len(months): ((i+1)*len(months))] for i in range(len(Region))]
is equivalent to something like this
res = []
for i in range(len(Region)):
row = [Region[i]] + result[i*len(months): ((i+1)*len(months))]
res.append(row)
where I use the length of Region to slice result in equals part for each row. And I add the name of the region at the begging of the row.
Another solution - more lines of code:
import pandas as pd
import ast
months = [1,2,3,4,5]
Regions = ['Region1', 'Region2']
df = pd.DataFrame()
for region in Regions:
row = '{\'Region\': \'' + region +'\', '
for month in months:
# put your calculation code
x = month + 1
row = row + '\'' + str(month) + '\':[' + str(x) + '],'
row = row[:len(row)-1] + '}'
row = ast.literal_eval(row)
df = df.append(pd.DataFrame(row))
df
This might work, depending on what and how y want the results.
import pandas as pd
months = [1, 2, 3, 4, 5]
Region = ['Region1', 'Region2']
df = pd.DataFrame(columns=[1, 2, 3]) # just to put in something
value = 59
for r in Region:
value += 5
for m in months:
df.loc[r, m] = chr(m+value)

Convert a large string to Dataframe

I've a large string looking like this :
'1 Start Date str_date B 10 C \n 2 Calculation notional cal_nt C 10 0\n 3 Calculation RATE Today cal_Rate_td C 9 R\n ....'
the issue is that I can't use one space or two to split my string because from Start date to str_date theres 2spaces but in the next line there will be 3 for example and maybe the next line will have 1space to seperate ... this makes it very hard to create a correct DataFrame as I want, is there a way to this ? thanks
to get a list with all the words that have _ (as you requested in the comments) you could use a regular expression:
import re
s = '1 Start Date str_date B 10 C \n 2 Calculation notional cal_nt C 10 0\n 3 Calculation RATE Today cal_Rate_td C 9 R\n ....'
list(map(re.Match.group, re.finditer(r'\w+_.\w+', s)))
output:
['str_date', 'cal_nt', 'cal_Rate_td']
or you can use a list comprehension:
[e for e in s.split() if '_' in e]
output:
['str_date', 'cal_nt', 'cal_Rate_td']
to get a data frame from your string you could use the above information, the third field:
s = '1 Start Date str_date B 10 C \n 2 Calculation notional cal_nt C 10 0\n 3 Calculation RATE Today cal_Rate_td C 9 R\n'
third_fields = [e for e in s.split() if '_' in e]
rows = []
for third_field, row in zip(third_fields, s.split('\n')):
current_row = []
row = row.strip()
first_field = re.search(r'\d+\b', row).group()
current_row.append(first_field)
# remove first field
row = row[len(first_field):].strip()
second_field, rest_of_fields = row.split(third_field)
parsed_fields = [e.group() for e in re.finditer(r'\b[\w\d]+\b', rest_of_fields)]
current_row.extend([second_field, third_field, *parsed_fields])
rows.append(current_row)
pd.DataFrame(rows)
output:
Like #kederrac answer, you can use regex to split them
import re
s = "1 Start Date str_date B 10 C "
l = re.compile("\s+").split(s.strip())
# output ['1', 'Start', 'Date', 'str_date', 'B', '10', 'C']

Categories

Resources