I have two lists, one of them is some lines, and the other is some values for these lines as follows:
text = ['Hello, ','I need some help in here ','things are not working well ','so i posted ','this question here ','hoping to get some ','good ','answers ','out of you ','that\'s it ','thanks']
value = [1,1,0,1,1,1,0,1,1,0,1]
Goal is to concatenate lines that meet the value 1 continuously, to get this result in any way possible:
['Hello ,I need some help in here ',
'so i posted this question here hoping to get some',
'answers out of you ',
'thanks']
I tried to put it as a DataFrame but I then didn't know how to go on (using pandas in solution is not a must)
print(pd.DataFrame(data={"text":text,"value":value}))
text value
0 Hello, 1
1 I need some help in here 1
2 things are not working well 0
3 so i posted 1
4 this question here 1
5 hoping to get some 1
6 good 0
7 answers 1
8 out of you 1
9 that's it 0
10 thanks 1
Waiting for some Answers
There is no need to use Pandas:
tmp_str = ""
results = []
for chuck, is_evolved in zip(text, value):
if is_evolved:
tmp_str += chuck
else:
results.append(tmp_str)
tmp_str = ""
if tmp_str:
results.append(tmp_str)
print(results)
If you want a pandas approach you can use pandas.Series.cumsum, pandas.DataFrame.groupby, df.groupby.transform and aggregate by str.join, and then access indices where value is 1:
>>> df.groupby(
df['value'].ne(df['value'].shift(1)
).cumsum()
).transform(' '.join)[df['value'].eq(1)].drop_duplicates()
text
0 Hello, I need some help in here
3 so i posted this question here hoping to get some
7 answers out of you
10 thanks
EXPLANATION
>>> df['value'].ne(df['value'].shift(1)).cumsum()
0 1
1 1
2 2
3 3
4 3
5 3
6 4
7 5
8 5
9 6
10 7
Name: value, dtype: int32
>>> df.groupby(df['value'].ne(df['value'].shift(1)).cumsum()).transform(' '.join)
text
0 Hello, I need some help in here
1 Hello, I need some help in here
2 things are not working well
3 so i posted this question here hoping to get some
4 so i posted this question here hoping to get some
5 so i posted this question here hoping to get some
6 good
7 answers out of you
8 answers out of you
9 that's it
10 thanks
If you don't need a dataframe, you can use itertools.groupby over zipped values of (text, value) and groupby the second element, i.e. value. Then str.join groups' text part if key == 1.
>>> from itertools import groupby
>>> [' '.join([*zip(*g)][0]) for k, g in groupby(zip(text, value), lambda x: x[1]) if k]
['Hello, I need some help in here ',
'so i posted this question here hoping to get some ',
'answers out of you ',
'thanks']
Solution
Using pythonic without using pandas:
text_ = [text[count] for count, n in enumerate(value) if n == 1]
Description
This will take the list item in text at the count in the for loop if the list item in value equals 1.
Output
['Hello, ', 'I need some help in here ', 'so i posted ', 'this question here ', 'hoping to get some ', 'answers ', 'out of you ', 'thanks']
Related
How can I pass grouping index value as an additional argument alongside the group's subdataframe?
This crude example just applies a univariate function:
df = pd.DataFrame(data=np.random.randint(0,10, size=(3,3)), index = ['a','b','a'])
t = df.groupby(df.index).apply(lambda x: ''.join(str(x)))
0 1 2
a 8 6 7
b 6 2 4
a 8 2 4
This function accepts as argument the index upon which the dataframe was grouped.
def f(g, indx):
return ''.join(str(x)) +'___' str(indx)
The output should be:
0
a '8 6 7 8 2 4___a'
b '6 2 4___b'
I understand that this example is trivial, but the point is to pass the grouping index value as argument alongside the grouped subdataframe. The solution I see is to iterate over the grouping object. I am not sure it's good solution performance-wise.
Mathematica has MapIndexed function that does the job but without prior grouping. It seems this question was asked before.
You can get to the index name via .name. So you do something like:
df.groupby(df.index).apply(lambda x: ''.join(str(x.values)) + '___' + str(x.name))
The output is not exactly what you want but figured I'd get this info to you quickly. Assume that you can clean it up to what you want.
Output (older version of your data):
a [[8 4 6]\n [6 8 9]]___a
b [[1 3 2]]___b
I have a dataframe containing lists of words in each row in the same column. I'd like to remove what I guess are spaces. I managed to get rid of some by doing:
for i in processed.text:
for x in i:
if x == '' or x==" ":
i.remove(x)
But some of them still remain.
>processed['text']
0 [have, month, #postdoc, within, on, chemical, ...
1 [hardworking, producers, iowa, so, for, state,...
2 [hardworking, producers, iowa, so, for, state,...
3 [today, time, is, to, sources, energy, much, p...
4 [thanks, gaetanos, club, c, oh, choosing, #rec...
...
130736 [gw, fossil, renewable, import, , , , , , , , ...
130737 [s, not, , go, ]
130738 [answer, deforestation, in, ]
130739 [plastic, regrind, any, and, grades, we, make,...
130740 [grid, generating, of, , , , gw]
Name: text, Length: 130741, dtype: object
>type(processed)
<class 'pandas.core.frame.DataFrame'>
Thank you very much.
Split on comma remove empty values and then combine again with comma
def remove_empty(x):
if type(x) is str:
x = x.split(",")
x = [ y for y in x if y.strip()]
return ",".join(x)
elif type(x) is list:
return [ y for y in x if y.strip()]
processed['text'] = processed['text'].apply(remove_empty)
You can use split(expand=True) to do that. Note: You dont have to specifically give spilt(' ', expand=True). By default, it takes ' ' as the value. You can replace ' ' with anything. For ex: if your words separate with , or -, then you can use that separator to split the columns.
import pandas as pd
df = pd.DataFrame({'Col1':['This is a long sentence',
'This is another long sentence',
'This is short',
'This is medium length',
'Wow. Tiny',
'Petite',
'Ok']})
print (df)
df = df.Col1.str.split(' ',expand=True)
print (df)
The output of this will be:
Original dataframe:
Col1
0 This is a long sentence
1 This is another long sentence
2 This is short
3 This is medium length
4 Wow. Tiny
5 Petite
6 Ok
Dataframe split into columns
0 1 2 3 4
0 This is a long sentence
1 This is another long sentence
2 This is short None None
3 This is medium length
4 Wow. Tiny None None None
5 Petite None None None None
6 Ok None None None None
If you want to limit them to 3 columns only, then use n=2
df = df.Col1.str.split(' ',n = 2, expand=True)
The output will be:
0 1 2
0 This is a long sentence
1 This is another long sentence
2 This is short
3 This is medium length
4 Wow. Tiny None
5 Petite None None
6 Ok None None
If you want to rename the columns to be more specific, then you can add rename to the end like this.
df = df.Col1.str.split(' ',n = 2, expand=True).rename({0:'A',1:'B',2:'C'},axis=1)
A B C
0 This is a long sentence
1 This is another long sentence
2 This is short
3 This is medium length
4 Wow. Tiny None
5 Petite None None
6 Ok None None
In case you want to replace all the None with '' and also prefix the column names, you can do it as follws:
df = df.Col1.str.split(expand=True).add_prefix('Col').fillna('')
Col0 Col1 Col2 Col3 Col4
0 This is a long sentence
1 This is another long sentence
2 This is short
3 This is medium length
4 Wow. Tiny
5 Petite
6 Ok
I have data that contains 'None ...' string at random places. I am trying to replace a cell in the dataframe with empty character only when it begin with 'None ..'. Here is what I tried, but I get errors like 'KeyError'.
df = pd.DataFrame({'id': [1,2,3,4,5],
'sub': ['None ... ','None ... test','math None ...','probability','chemistry']})
df.loc[df['sub'].str.replace('None ...','',1), 'sub'] = '' # getting key error
output looking for: (I need to replace entire value in cell if 'None ...' is starting string. Notice, 3rd row shouldn't be replaced because 'None ...' is not starting character)
id sub
1
2
3 math None ...
4 probability
5 chemistry
You can use the below to identify the cells to replace and then assign them an empty value:
df.loc[df['sub'].str.startswith("None"), 'sub'] = ""
df.head()
id sub
0 1
1 2
2 3 math None ...
3 4 probability
4 5 chemistry
You can simpy replace 'None ...' and by using a regular expression you can apply this replacement only for strings that start with None.
df['sub'] = df['sub'].str.replace(r'^None \.\.\.*','',1)
the output looks like this:
id sub
0 1
1 2 test
2 3 math None ...
3 4 probability
4 5 chemistry
df['sub'] = df['sub'].str.replace('[\w\s]*?(None \.\.\.)[\s\w]*?','',1)
Out:
sub
id
1
2 test
3
4 probability
5 chemistry
Look at startswith, then after we find the row need to be replaced we using replace
df['sub']=df['sub'].mask(df['sub'].str.startswith('None ... '),'')
df
Out[338]:
id sub
0 1
1 2
2 3 math None ...
3 4 probability
4 5 chemistry
First, you are using the sub strings as index, that is why you received key error.
Second you can do this by:
df['sub']=df['sub'].apply(lambda x: '' if x.find('None')==0 else x)
I have an input string which looks like this:
ms = 'hello stack overflow friends'
And a pandas dataframe with the following structure:
string priority value
0 hi 1 2
1 astronaut 10 3
2 overflow 3 -1
3 varfoo 4 1
4 hello 2 0
Then I'm trying to do the following simple algorithm:
Sort ascending the pandas dataframe by df['priority'] column.
Check if the ms string variable contains the df['string'] word.
If so, return its df['value'].
Therefore, this is my approach to do so:
import pandas as pd
ms = 'hello stack overflow friends'
df = pd.DataFrame({'string': ['hi', 'astronaut', 'overflow', 'varfoo', 'hello'],
'priority': [1, 10, 3, 4, 2],
'value': [2, 3, -1, 1, 0]})
final_val = None
for _, row in df.sort_values('priority').iterrows():
# just printing the current row for debug purposes
print (row['string'], row['priority'], row['value'])
if ms.find(row['string']) > -1:
final_val = row['value']
break
print()
print("The final value for '", ms, "' is ", final_val)
Which returns the following:
hi 1 2
hello 2 0
The final value for ' hello stack overflow friends ' is 0
This code works ok, but the thing is that my df has like 20K rows, and I need to perform this kind of search more than 1K times.
This dramatically decreases the performance of my process. So is there a better (or simpler) approach than mine using pure pandas and avoiding unnecessary loops?
Write a function that you can apply to your dataframe rather than using iterrows
match_set = set(ms.split())
def check_matches(row):
return row['value'] if row['string'] in match_set else None
df['matched'] = df.apply(check_matches, axis=1)
Which gives you:
priority string value matched
0 1 hi 2 NaN
1 10 astronaut 3 NaN
2 3 overflow -1 -1.0
3 4 varfoo 1 NaN
4 2 hello 0 0.0
Then you can sort the values and take the first non NaN value from df.matched to get what you called final_value.
df.sort_values('priority').matched.dropna().iloc[0]
0.0
Alternatively, you could sort and convert the df into a list of tuples:
l = df.sort_values('priority').apply(lambda r: (r['string'], r['value']), axis=1).tolist()
Giving:
l
[('hi', 2), ('hello', 0), ('overflow', -1), ('varfoo', 1), ('astronaut', 3)]
And write a function that stops when it hits the first match:
def check_matches(l):
for (k, v) in l:
if k in match_set:
return v
check_matches(l)
0
I'm currently facing a problem with method chaining in manipulating data frames in pandas, here is the structure of my data:
import pandas as pd
lst1 = range(100)
lst2 = range(100)
lst3 = range(100)
df = pd.DataFrame(
{'Frenquency': lst1,
'lst2Tite': lst2,
'lst3Tite': lst3
})
the question is get entries(rows) if the frequency is less than 6, but it needs to be done in method chaining.
I know using a traditional way is easy, I could just do
df[df["Frenquency"]<6]
to get the answer.
However, the question is about how to do it with method chaining, I tried something like
df.drop(lambda x:x.index if x["Frequency"] <6 else null)
but it raised an error "[<function <lambda> at 0x7faf529d3510>] not contained in axis"
Could anyone share some light on this issue?
This is an old question but I will answer since there is no accepted answer for future reference.
df[df.apply(lambda x: True if (x.Frenquency) <6 else False,axis=1)]
explanation: This lambda function checks the frequency and if yes it assigns True otherwise False and that series of True and False used by df to index the true values only. Note the column name Frenquency is a typo but I kept as it is since the question was like so.
Or maybe this:
df.drop(i for i in df.Frequency if i >= 6)
Or use inplace:
df.drop((i for i in df.Frequency if i >= 6), inplace=True)
For this sort of selection, you can maintain a fluent interface and use method-chaining by using the query method:
>>> df.query('Frenquency < 6')
Frenquency lst2Tite lst3Tite
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
>>>
So something like:
df.rename(<something>).query('Frenquency <6').assign(<something>)
Or more concretely:
>>> (df.rename(columns={'Frenquency':'F'})
... .query('F < 6')
... .assign(FF=lambda x: x.F**2))
F lst2Tite lst3Tite FF
0 0 0 0 0
1 1 1 1 1
2 2 2 2 4
3 3 3 3 9
4 4 4 4 16
5 5 5 5 25
Feel this post did not have the answers that addressed the spirit of the question. The most chain-friendly way is (probably) to use Panda's .loc.
import pandas as pd
lst1 = range(100)
lst2 = range(100)
lst3 = range(100)
df = pd.DataFrame({"Frequency": lst1, "lst2Tite": lst2, "lst3Tite": lst3})
df.loc[lambda _df: 6 < _df["Frequency"]]
Simple!
Would this satisfy your needs?
df.mask(df.Frequency >= 6).dropna()