Split every row containing long text into multiple rows in pandas - python

I have a DataFrame which has a string column such as below:
id text label
1 this is long string with many words 1
2 this is a middle string 0
3 short string 1
and i want to convert this DataFrame to another DataFrame based on the string length i.e. (df['text'].str.len > 3) :
id text label
1 this is long 1
1 string with many 1
1 words 1
2 this is a 0
2 middle string 0
3 short string 1
this is my code:
pd.concat(df['text'].str.len() > 200)
but it is wrong.

You could
In [1257]: n = 3
In [1279]: df.set_index(['label', 'id'])['text'].str.split().apply(
lambda x: pd.Series([' '.join(x[i:i+n]) for i in range(0, len(x), n)])
).stack().reset_index().drop('level_2', 1)
Out[1279]:
label id 0
0 1 1 this is long
1 1 1 string with many
2 1 1 words
3 0 2 this is a
4 0 2 middle string
5 1 3 short string
Details
label text id
0 1 this is long string with many words 1
1 0 this is a middle string 2
2 1 short string 3

IIUC
v=df.text.str.split(' ')
s=pd.DataFrame({'text':v.sum(),'label':df.label.repeat(v.str.len())})
s['New']=s.groupby(s.index).cumcount()
s.groupby([s.New//3,s.index.get_level_values(level=0)]).agg({'text':lambda x : ' '.join(x),'label':'first'}).sort_index(level=1)
Out[1785]:
text label
New
0 0 this is long 1
1 0 string with many 1
2 0 words 1
0 1 this is a 0
1 1 middle string 0
0 2 short string 1

This is one solution, using a couple of for loops to split your text into sets of 3:
array = []
for ii,row in df.iterrows():
if row['text'].split() > 3:
jj = 0
while jj < len(row['text'].split()):
array.append(
pd.Series(
{'id':row['id'],'label':row['label'],
'text':row['text'].split()[jj:jj+3]}
)
)
jj += 3
else:
array.append(row)

Related

Removing the first and only the first '-' in the values of a string column

1 2
0 ADRC-111-01 ADRC111
1 ADRC-11955-01 ADRC11955
2 ADRC-18133-01 ADRC18133
3 SWAN0023-03 SWAN0023
In Column 1, I wish to get rid of the first - sign, regardless of how many are in the the cell. There are one or two - in each entry.
Desired output:
1 2
0 ADRC111-01 ADRC111
1 ADRC11955-01 ADRC11955
2 ADRC18133-01 ADRC18133
3 SWAN002303 SWAN0023
Use .str.replace with n=1:
df['1'] = df['1'].str.replace('-', '', n=1)
Output:
>>> df
1 2
0 ADRC111-01 ADRC111
1 ADRC11955-01 ADRC11955
2 ADRC18133-01 ADRC18133
3 SWAN002303 SWAN0023

Conditional sum of non zero values

I have a daraframe as below:
Datetime Data Fn
0 18747.385417 11275.0 0
1 18747.388889 8872.0 1
2 18747.392361 7050.0 0
3 18747.395833 8240.0 1
4 18747.399306 5158.0 1
5 18747.402778 3926.0 0
6 18747.406250 4043.0 0
7 18747.409722 2752.0 1
8 18747.420139 3502.0 1
9 18747.423611 4026.0 1
I want to calculate the sum of continious non zero values of Column (Fn)
I want my result dataframe as below:
Datetime Data Fn Sum
0 18747.385417 11275.0 0 0
1 18747.388889 8872.0 1 1
2 18747.392361 7050.0 0 0
3 18747.395833 8240.0 1 1
4 18747.399306 5158.0 1 2 <<<
5 18747.402778 3926.0 0 0
6 18747.406250 4043.0 0 0
7 18747.409722 2752.0 1 1
8 18747.420139 3502.0 1 2
9 18747.423611 4026.0 1 3
You can use groupby() and cumsum():
groups = df.Fn.eq(0).cumsum()
df['Sum'] = df.Fn.ne(0).groupby(groups).cumsum()
Details
First use df.Fn.eq(0).cumsum() to create pseudo-groups of consecutive non-zeros. Each zero will get a new id while consecutive non-zeros will keep the same id:
groups = df.Fn.eq(0).cumsum()
# groups Fn (Fn added just for comparison)
# 0 1 0
# 1 1 1
# 2 2 0
# 3 2 1
# 4 2 1
# 5 3 0
# 6 4 0
# 7 4 1
# 8 4 1
# 9 4 1
Then group df.Fn.ne(0) on these pseudo-groups and cumsum() to generate the within-group sequences:
df['Sum'] = df.Fn.ne(0).groupby(groups).cumsum()
# Datetime Data Fn Sum
# 0 18747.385417 11275.0 0 0
# 1 18747.388889 8872.0 1 1
# 2 18747.392361 7050.0 0 0
# 3 18747.395833 8240.0 1 1
# 4 18747.399306 5158.0 1 2
# 5 18747.402778 3926.0 0 0
# 6 18747.406250 4043.0 0 0
# 7 18747.409722 2752.0 1 1
# 8 18747.420139 3502.0 1 2
# 9 18747.423611 4026.0 1 3
How about using cumsum and reset when value is 0
df['Fn2'] = df['Fn'].replace({0: False, 1: True})
df['Fn2'] = df['Fn2'].cumsum() - df['Fn2'].cumsum().where(df['Fn2'] == False).ffill().astype(int)
df
You can store the fn column in a list and then create a new list and iterate over the stored fn column and check the previous index value if it is greater than zero then add it to current index else do not update it and after this u can make a dataframe for the list and concat column wise to existing dataframe
fn=df[Fn]
sum_list[0]=fn first value
for i in range(1,lenghtofthe column):
if fn[i-1]>0:
sum_list.append(fn[i-1]+fn[i])
else:
sum_list.append(fn[i])
dfsum=pd.Dataframe(sum_list)
df=pd.concat([df,dfsum],axis=1)
Hope this will help you.there may me syntax errors that you can refer google.But the idea is this
try this:
sum_arr = [0]
for val in df['Fn']:
if val > 0:
sum_arr.append(sum_arr[-1] + 1)
else:
sum_arr.append(0)
df['sum'] = sum_arr[1:]
df

How to merge strings that have certain number of substrings in common to produce some groups in a data frame in Python

I asked a question like this. But that is a simple one. Which has been resolved. how to merge strings that have substrings in common to produce some groups in a data frame in Python.
But here, I have an advanced version of the similar question:
I have a sample data:
a=pd.DataFrame({'ACTIVITY':['b,c','a','a,c,d,e','f,g,h,i','j,k,l','k,l,m']})
What I want to do is merge some strings if they have sub strings in common. So, in this example, the strings 'b,c','a','a,c,d,e' should be merged together because they can be linked to each other. 'j,k,l' and 'k,l,m' should be in one group. In the end, I hope I can have something like:
group
'b,c', 0
'a', 0
'a,c,d,e', 0
'f,g,h,i', 1
'j,k,l', 2
'k,l,m' 2
So, I can have three groups and there is no common sub strings between any two groups.
Now, I am trying to build up a similarity data frame, in which 1 means two strings have sub strings in common. Here is my code:
commonWords=1
for i in np.arange(a.shape[0]):
a.loc[:,a.loc[i,'ACTIVITY']]=0
for i in a.loc[:,'ACTIVITY']:
il=i.split(',')
for j in a.loc[:,'ACTIVITY']:
jl=j.split(',')
c=[x in il for x in jl]
c1=[x for x in c if x==True]
a.loc[(a.loc[:,'ACTIVITY']==i),j]=1 if len(c1)>=commonWords else 0
a
The result is:
ACTIVITY b,c a a,c,d,e f,g,h,i j,k,l k,l,m
0 b,c 1 0 1 0 0 0
1 a 0 1 1 0 0 0
2 a,c,d,e 1 1 1 0 0 0
3 f,g,h,i 0 0 0 1 0 0
4 j,k,l 0 0 0 0 1 1
5 k,l,m 0 0 0 0 1 1
In this code, commonWords means how many sub strings I hope that two strings have in common. For example, if commonWords=2, then two strings will be merged together only if there are two, or more than two sub strings in them. When commonWords=2, the group should be:
group
'b,c', 0
'a', 1
'a,c,d,e', 2
'f,g,h,i', 3
'j,k,l', 4
'k,l,m' 4
Use:
a=pd.DataFrame({'ACTIVITY':['b,c','a','a,c,d,e','f,g,h,i','j,k,l','k,l,m']})
from itertools import combinations, chain
from collections import Counter
#split values by , to lists
splitted = a['ACTIVITY'].str.split(',')
commonWords=2
#create edges (can only connect two nodes)
L2_nested = [list(combinations(l,commonWords)) for l in splitted]
L2 = list(chain.from_iterable(L2_nested))
#convert values to sets
f1 = [set(k) for k, v in Counter(L2).items() if v >= commonWords]
f2 = [set(x) for x in splitted]
#create new columns for matched sets
for val in f1:
j = ','.join(val)
a[j] = [j if len(val & x) == commonWords else np.nan for x in f2]
print (a)
#forward filling values of new columns and use factorize for groups
new = pd.factorize(a[['ACTIVITY']].assign(ACTIVITY = a.index).ffill(axis=1).iloc[:, -1])[0]
a = a[['ACTIVITY']].assign(group = new)
print (a)
ACTIVITY group
0 b,c 0
1 a 1
2 a,c,d,e 2
3 f,g,h,i 3
4 j,k,l 4
5 k,l,m 4

Get the sum of rows that contain 0 as a value

I want to know how can I make the source code of the following problem based on Python.
I have a dataframe that contain this column:
Column X
1
0
0
0
1
1
0
0
1
I want to create a list b counting the sum of successive 0 value for getting something like that :
List X
1
3
3
3
1
1
2
2
1
If I understand your question correctly, you want to replace all the zeros with the number of consecutive zeros in the current streak, but leave non-zero numbers untouched. So
1 0 0 0 0 1 0 1 1 0 0 1 0 1 0 0 0 0 0
becomes
1 4 4 4 4 1 1 1 1 2 2 1 1 1 5 5 5 5 5
To do that, this should work, assuming your input column (a pandas Series) is called x.
result = []
i = 0
while i < len(x):
if x[i] != 0:
result.append(x[i])
i += 1
else:
# See how many times zero occurs in a row
j = i
n_zeros = 0
while j < len(x) and x[j] == 0:
n_zeros += 1
j += 1
result.extend([n_zeros] * n_zeros)
i += n_zeros
result
Adding screenshot below to make usage clearer

Conditional length of a binary data series in Pandas

Having a DataFrame with the following column:
df['A'] = [1,1,1,0,1,1,1,1,0,1]
What would be the best vectorized way to control the length of "1"-series by some limiting value? Let's say the limit is 2, then the resulting column 'B' must look like:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
One fully-vectorized solution is to use the shift-groupby-cumsum-cumcount combination1 to indicate where consecutive runs are shorter than 2 (or whatever limiting value you like). Then, & this new boolean Series with the original column:
df['B'] = ((df.groupby((df.A != df.A.shift()).cumsum()).cumcount() <= 1) & df.A)\
.astype(int) # cast the boolean Series back to integers
This produces the new column in the DataFrame:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
1 See the pandas cookbook; the section on grouping, "Grouping like Python’s itertools.groupby"
Another way (checking if previous two are 1):
In [443]: df = pd.DataFrame({'A': [1,1,1,0,1,1,1,1,0,1]})
In [444]: limit = 2
In [445]: df['B'] = map(lambda x: df['A'][x] if x < limit else int(not all(y == 1 for y in df['A'][x - limit:x])), range(len(df)))
In [446]: df
Out[446]:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
If you know that the values in the series will all be either 0 or 1, I think you can use a little trick involving convolution. Make a copy of your column (which need not be a Pandas object, it can just be a normal Numpy array)
a = df['A'].as_matrix()
and convolve it with a sequence of 1's that is one longer than the cutoff you want, then chop off the last cutoff elements. E.g. for a cutoff of 2, you would do
long_run_count = numpy.convolve(a, [1, 1, 1])[:-2]
The resulting array, in this case, gives the number of 1's that occur in the 3 elements prior to and including that element. If that number is 3, then you are in a run that has exceeded length 2. So just set those elements to zero.
a[long_run_count > 2] = 0
You can now assign the resulting array to a new column in your DataFrame.
df['B'] = a
To turn this into a more general method:
def trim_runs(array, cutoff):
a = numpy.asarray(array)
a[numpy.convolve(a, numpy.ones(cutoff + 1))[:-cutoff] > cutoff] = 0
return a

Categories

Resources