I have multiple columns with the same name mixed with other columns.
Some of these columns that I want to combine have null in the rows.
For instance,
apple_0 apple_1
0 abc None
1 abc efg
2 hig None
3 dsf None
and I want:
apple
0 abc
1 abc, efg
2 hig
3 dsf
I have like 85 of these columns.
The actual names are: scheduleSettings_nodes_0_name, scheduleSettings_nodes_1_name and so on
How can I combine these?
In addition to the other answers, you could also try using agg something like this.
df = pd.DataFrame({'apple_0': ['abc', 'abc', 'hig', 'dsf'], 'apple_1': [None, 'efg', None, None]})
selected_cols = [col for col in df.columns if col.startswith('apple')]
df['apple'] = df[selected_cols].agg(lambda x: ', '.join(map(str, filter(None, x))), axis=1)
Option 1: using stacking to drop the null, then aggregation per group.
(df.filter(like='apple_')
.replace('None', pd.NA)
.stack()
.groupby(level=0).agg(','.join)
.reindex(df.index)
.to_frame('apple')
)
Option 2: using an internal loop with agg.
(df.filter(like='apple_')
.replace('None', pd.NA)
.agg(lambda r: ','.join(x for x in r if pd.notna(x)), axis=1)
.to_frame('apple')
)
Output:
apple
0 abc
1 abc,efg
2 hig
3 dsf
You can use df.apply with axis=1 to apply a function to each row. This function can use str.join to join the elements of all columns that match apple_* if the elements are not NA
def join_cols(row, sep, col_names):
return sep.join(row[c]
for c in col_names
if not pd.isna(row[c]))
cols_to_combine = ["apple_0", "apple_1"]
df["apple"] = df.apply(join_cols, axis=1, args=(",", cols_to_combine))
df.drop(columns=cols_to_combine, inplace=True)
Which gives your desired output:
apple
0 abc
1 abc,efg
2 hig
3 dsf
To figure out which columns match your pattern, you could do:
cols_to_combine = [c for c in df.columns if c.startswith("apple_")]
There's probably a way to vectorize this, which would be preferred over apply() in terms of speed
Related
I want to merge several strings in a dataframe based on a groupedby in Pandas.
This is my code so far:
import pandas as pd
from io import StringIO
data = StringIO("""
"name1","hej","2014-11-01"
"name1","du","2014-11-02"
"name1","aj","2014-12-01"
"name1","oj","2014-12-02"
"name2","fin","2014-11-01"
"name2","katt","2014-11-02"
"name2","mycket","2014-12-01"
"name2","lite","2014-12-01"
""")
# load string as stream into dataframe
df = pd.read_csv(data,header=0, names=["name","text","date"],parse_dates=[2])
# add column with month
df["month"] = df["date"].apply(lambda x: x.month)
I want the end result to look like this:
I don't get how I can use groupby and apply some sort of concatenation of the strings in the column "text". Any help appreciated!
You can groupby the 'name' and 'month' columns, then call transform which will return data aligned to the original df and apply a lambda where we join the text entries:
In [119]:
df['text'] = df[['name','text','month']].groupby(['name','month'])['text'].transform(lambda x: ','.join(x))
df[['name','text','month']].drop_duplicates()
Out[119]:
name text month
0 name1 hej,du 11
2 name1 aj,oj 12
4 name2 fin,katt 11
6 name2 mycket,lite 12
I sub the original df by passing a list of the columns of interest df[['name','text','month']] here and then call drop_duplicates
EDIT actually I can just call apply and then reset_index:
In [124]:
df.groupby(['name','month'])['text'].apply(lambda x: ','.join(x)).reset_index()
Out[124]:
name month text
0 name1 11 hej,du
1 name1 12 aj,oj
2 name2 11 fin,katt
3 name2 12 mycket,lite
update
the lambda is unnecessary here:
In[38]:
df.groupby(['name','month'])['text'].apply(','.join).reset_index()
Out[38]:
name month text
0 name1 11 du
1 name1 12 aj,oj
2 name2 11 fin,katt
3 name2 12 mycket,lite
We can groupby the 'name' and 'month' columns, then call agg() functions of Panda’s DataFrame objects.
The aggregation functionality provided by the agg() function allows multiple statistics to be calculated per group in one calculation.
df.groupby(['name', 'month'], as_index = False).agg({'text': ' '.join})
The answer by EdChum provides you with a lot of flexibility but if you just want to concateate strings into a column of list objects you can also:
output_series = df.groupby(['name','month'])['text'].apply(list)
If you want to concatenate your "text" in a list:
df.groupby(['name', 'month'], as_index = False).agg({'text': list})
For me the above solutions were close but added some unwanted /n's and dtype:object, so here's a modified version:
df.groupby(['name', 'month'])['text'].apply(lambda text: ''.join(text.to_string(index=False))).str.replace('(\\n)', '').reset_index()
Please try this line of code : -
df.groupby(['name','month'])['text'].apply(','.join).reset_index()
Although, this is an old question. But just in case. I used the below code and it seems to work like a charm.
text = ''.join(df[df['date'].dt.month==8]['text'])
Thanks to all the other answers, the following is probably the most concise and feels more natural. Using df.groupby("X")["A"].agg() aggregates over one or many selected columns.
df = pandas.DataFrame({'A' : ['a', 'a', 'b', 'c', 'c'],
'B' : ['i', 'j', 'k', 'i', 'j'],
'X' : [1, 2, 2, 1, 3]})
A B X
a i 1
a j 2
b k 2
c i 1
c j 3
df.groupby("X", as_index=False)["A"].agg(' '.join)
X A
1 a c
2 a b
3 c
df.groupby("X", as_index=False)[["A", "B"]].agg(' '.join)
X A B
1 a c i i
2 a b j k
3 c j
I have a df with two columns a and b.
import pandas as pd
raw_data = {'a': ['2019145236792', 'abc_def date_1220', '2020124832852', 'jhi_abc this_1219_abc'],
'b': ['tom','john','mark','jim']}
df = pd.DataFrame(raw_data, columns=['a', 'b'])
df
a b
0 2019145236792 tom
1 abc_def date_1220 john
2 2020124832852 mark
3 jhi_abc this_1219_abc20 jim
I want to seperate the data which only contains 20. The position of 20 won't change.
eg: 2020124832852 and abc_def date_1220
Expected output:
a b
0 abc_def date_1220 john
1 2020124832852 mark
Use boolean indexing with comapre by Series.eq and indexing by str chained by | for bitwise OR by second mask with Series.str.extract for values after date_:
m1 = df['a'].str[2:4].eq('20')
m2 = df['a'].str.extract('date_(.*)', expand=False).str[2:4].eq('20')
df = df[m1 | m2]
print (df)
a b
1 abc_def date_1220 john
2 2020124832852 mark
EDIT:
m2 = df['a'].str.split('_', n=2).str[2].str[2:4].eq('20')
you can use a list comprehension to get the wanted rows but you have to specify the required positions:
import re
req_pos = {2, 15}
df[[any(e.start() in req_pos for e in re.finditer('20', s)) for s in df.a]]
I have a dataset that on one of its columns, each element is a list.
I would like to flatten it, such that every list element would have a row of it's own.
I managed to solve it with iterrows, dict and append(see below) but it is too slow with my true DF that is large.
Is there a way to make things faster?
I can consider replacing the column with list per element in another format (maybe hierarchical df? ) if that would make more sense.
EDIT: I have many columns, and some might change in the future. The only thing i know for sure is that I have the fields column. That's why I used dict in my solution
A minimal example, creating a df to play with:
import StringIO
df = pd.read_csv(StringIO.StringIO("""
id|name|fields
1|abc|[qq,ww,rr]
2|efg|[zz,xx,rr]
"""), sep='|')
df.fields = df.fields.apply(lambda s: s[1:-1].split(','))
print df
resulting df:
id name fields
0 1 abc [qq, ww, rr]
1 2 efg [zz, xx, rr]
my (slow) solution:
new_df = pd.DataFrame(index=[], columns=df.columns)
for _, i in df.iterrows():
flattened_d = [dict(i.to_dict(), fields=c) for c in i.fields]
new_df = new_df.append(flattened_d )
Resulting with
id name fields
0 1.0 abc qq
1 1.0 abc ww
2 1.0 abc rr
0 2.0 efg zz
1 2.0 efg xx
2 2.0 efg rr
You can use numpy for better performance:
Both solutions use mainly numpy.repeat.
from itertools import chain
vals = df.fields.str.len()
df1 = pd.DataFrame({
"id": np.repeat(df.id.values,vals),
"name": np.repeat(df.name.values, vals),
"fields": list(chain.from_iterable(df.fields))})
df1 = df1.reindex_axis(df.columns, axis=1)
print (df1)
id name fields
0 1 abc qq
1 1 abc ww
2 1 abc rr
3 2 efg zz
4 2 efg xx
5 2 efg rr
Another solution:
df[['id','name']].values converts columns to numpy array and duplicate them by numpy.repeat, then stack values in lists by numpy.hstack and add it by numpy.column_stack.
df1 = pd.DataFrame(np.column_stack((df[['id','name']].values.
repeat(list(map(len,df.fields)),axis=0),np.hstack(df.fields))),
columns=df.columns)
print (df1)
id name fields
0 1 abc qq
1 1 abc ww
2 1 abc rr
3 2 efg zz
4 2 efg xx
5 2 efg rr
More general solution is filter out column fields and then add it to DataFrame constructor, because always last column:
cols = df.columns[df.columns != 'fields'].tolist()
print (cols)
['id', 'name']
df1 = pd.DataFrame(np.column_stack((df[cols].values.
repeat(list(map(len,df.fields)),axis=0),np.hstack(df.fields))),
columns=cols + ['fields'])
print (df1)
id name fields
0 1 abc qq
1 1 abc ww
2 1 abc rr
3 2 efg zz
4 2 efg xx
5 2 efg rr
If your CSV is many thousands of lines long, then using_string_methods (below)
may be faster than using_iterrows or using_repeat:
With
csv = 'id|name|fields'+("""
1|abc|[qq,ww,rr]
2|efg|[zz,xx,rr]"""*10000)
In [210]: %timeit using_string_methods(csv)
10 loops, best of 3: 100 ms per loop
In [211]: %timeit using_itertuples(csv)
10 loops, best of 3: 119 ms per loop
In [212]: %timeit using_repeat(csv)
10 loops, best of 3: 126 ms per loop
In [213]: %timeit using_iterrows(csv)
1 loop, best of 3: 1min 7s per loop
So for a 10000-line CSV, using_string_methods is over 600x faster than using_iterrows, and marginally faster than using_repeat.
import pandas as pd
try: from cStringIO import StringIO # for Python2
except ImportError: from io import StringIO # for Python3
def using_string_methods(csv):
df = pd.read_csv(StringIO(csv), sep='|', dtype=None)
other_columns = df.columns.difference(['fields']).tolist()
fields = (df['fields'].str.extract(r'\[(.*)\]', expand=False)
.str.split(r',', expand=True))
df = pd.concat([df.drop('fields', axis=1), fields], axis=1)
result = (pd.melt(df, id_vars=other_columns, value_name='field')
.drop('variable', axis=1))
result = result.dropna(subset=['field'])
return result
def using_iterrows(csv):
df = pd.read_csv(StringIO(csv), sep='|')
df.fields = df.fields.apply(lambda s: s[1:-1].split(','))
new_df = pd.DataFrame(index=[], columns=df.columns)
for _, i in df.iterrows():
flattened_d = [dict(i.to_dict(), fields=c) for c in i.fields]
new_df = new_df.append(flattened_d )
return new_df
def using_repeat(csv):
df = pd.read_csv(StringIO(csv), sep='|')
df.fields = df.fields.apply(lambda s: s[1:-1].split(','))
cols = df.columns[df.columns != 'fields'].tolist()
df1 = pd.DataFrame(np.column_stack(
(df[cols].values.repeat(list(map(len,df.fields)),axis=0),
np.hstack(df.fields))), columns=cols + ['fields'])
return df1
def using_itertuples(csv):
df = pd.read_csv(StringIO(csv), sep='|')
df.fields = df.fields.apply(lambda s: s[1:-1].split(','))
other_columns = df.columns.difference(['fields']).tolist()
data = []
for tup in df.itertuples():
data.extend([[getattr(tup, col) for col in other_columns]+[field]
for field in tup.fields])
return pd.DataFrame(data, columns=other_columns+['field'])
csv = 'id|name|fields'+("""
1|abc|[qq,ww,rr]
2|efg|[zz,xx,rr]"""*10000)
Generally, fast NumPy/Pandas operations are possible only when the data is in a
native NumPy dtype (such as int64 or float64, or strings.) Once you place
lists (a non-native NumPy dtype) in a DataFrame the jig is up -- you are forced
to use Python-speed loops to process the lists.
So to improve performance, you need to avoid placing lists in a DataFrame.
using_string_methods loads the fields data as strings:
df = pd.read_csv(StringIO(csv), sep='|', dtype=None)
and avoid using the apply method (which is generally as slow as a plain Python loop):
df.fields = df.fields.apply(lambda s: s[1:-1].split(','))
Instead, it uses faster vectorized string methods to break the strings up into
separate columns:
fields = (df['fields'].str.extract(r'\[(.*)\]', expand=False)
.str.split(r',', expand=True))
Once you have the fields in separate columns, you can use pd.melt to reshape
the DataFrame into the desired format.
pd.melt(df, id_vars=['id', 'name'], value_name='field')
By the way, you might be interested to see that with a slight modification using_iterrows can be just as fast as using_repeat. I show the changes in using_itertuples.
df.itertuples tends to be slightly faster than df.iterrows, but the difference is minor. The majority of the speed gain is achieved by avoiding calling df.append in a for-loop since that leads to quadratic copying.
You can break the lists in the fields column into multiple columns by applying pandas.Series to fields and then merging to id and name like so:
cols = df.columns[df.columns != 'fields'].tolist() # adapted from #jezrael
df = df[cols].join(df.fields.apply(pandas.Series))
Then you can melt the resulting new columns using set_index and stack, and then reseting the index:
df = df.set_index(cols).stack().reset_index()
Finally, drop the redundant column generated by reset_index and rename the generated column to "field":
df = df.drop(df.columns[-2], axis=1).rename(columns={0: 'field'})
How can I filter rows which column contain another column?
For example, if we have DT with two columns A, B, can we filter rows with B.contains(A)? Not just if B contains some A values from all A from DT, but just in one row.
A B
'lol' 'lolec'
'ram' 'rambo'
'ki' 'pio'
Result:
A B
'lol' 'lolec'
'ram' 'rambo'
You can use boolean indexing with mask created by apply and in if need filter columns A and B per rows:
#if necessary strip ' in all values
df = df.apply(lambda x: x.str.strip("'"))
#df = df.applymap(lambda x: x.strip("'"))
print (df.apply(lambda x: x.A in x.B, axis=1))
0 True
1 True
2 False
dtype: bool
df = df[df.apply(lambda x: x.A in x.B, axis=1)]
print (df)
A B
0 lol lolec
1 ram rambo
Difference of solutions - input DataFrame is changed:
print (df)
A B
0 lol pio
1 ram rambo
2 ki lolec
print (df[df.apply(lambda x: x.A in x.B, axis=1)])
A B
1 ram rambo
print (df[df['B'].str.contains("|".join(df['A']))])
A B
1 ram rambo
2 ki lolec
for improve performance use list comprehension:
df = df[[a in b for a, b in zip(df.A, df.B)]]
You can use str.contains to match each of the substrings by using the regex | character which implies an OR selection from the contents of the other series:
df[df['B'].str.contains("|".join(df['A']))]
I have a pandas dataframe with the following column names:
Result1, Test1, Result2, Test2, Result3, Test3, etc...
I want to drop all the columns whose name contains the word "Test". The numbers of such columns is not static but depends on a previous function.
How can I do that?
Here is one way to do this:
df = df[df.columns.drop(list(df.filter(regex='Test')))]
import pandas as pd
import numpy as np
array=np.random.random((2,4))
df=pd.DataFrame(array, columns=('Test1', 'toto', 'test2', 'riri'))
print df
Test1 toto test2 riri
0 0.923249 0.572528 0.845464 0.144891
1 0.020438 0.332540 0.144455 0.741412
cols = [c for c in df.columns if c.lower()[:4] != 'test']
df=df[cols]
print df
toto riri
0 0.572528 0.144891
1 0.332540 0.741412
Cheaper, Faster, and Idiomatic: str.contains
In recent versions of pandas, you can use string methods on the index and columns. Here, str.startswith seems like a good fit.
To remove all columns starting with a given substring:
df.columns.str.startswith('Test')
# array([ True, False, False, False])
df.loc[:,~df.columns.str.startswith('Test')]
toto test2 riri
0 x x x
1 x x x
For case-insensitive matching, you can use regex-based matching with str.contains with an SOL anchor:
df.columns.str.contains('^test', case=False)
# array([ True, False, True, False])
df.loc[:,~df.columns.str.contains('^test', case=False)]
toto riri
0 x x
1 x x
if mixed-types is a possibility, specify na=False as well.
This can be done neatly in one line with:
df = df.drop(df.filter(regex='Test').columns, axis=1)
You can filter out the columns you DO want using 'filter'
import pandas as pd
import numpy as np
data2 = [{'test2': 1, 'result1': 2}, {'test': 5, 'result34': 10, 'c': 20}]
df = pd.DataFrame(data2)
df
c result1 result34 test test2
0 NaN 2.0 NaN NaN 1.0
1 20.0 NaN 10.0 5.0 NaN
Now filter
df.filter(like='result',axis=1)
Get..
result1 result34
0 2.0 NaN
1 NaN 10.0
Using a regex to match all columns not containing the unwanted word:
df = df.filter(regex='^((?!badword).)*$')
Use the DataFrame.select method:
In [38]: df = DataFrame({'Test1': randn(10), 'Test2': randn(10), 'awesome': randn(10)})
In [39]: df.select(lambda x: not re.search('Test\d+', x), axis=1)
Out[39]:
awesome
0 1.215
1 1.247
2 0.142
3 0.169
4 0.137
5 -0.971
6 0.736
7 0.214
8 0.111
9 -0.214
This method does everything in place. Many of the other answers create copies and are not as efficient:
df.drop(df.columns[df.columns.str.contains('Test')], axis=1, inplace=True)
Question states 'I want to drop all the columns whose name contains the word "Test".'
test_columns = [col for col in df if 'Test' in col]
df.drop(columns=test_columns, inplace=True)
You can use df.filter to get the list of columns that match your string and then use df.drop
resdf = df.drop(df.filter(like='Test',axis=1).columns.to_list(), axis=1)
Solution when dropping a list of column names containing regex. I prefer this approach because I'm frequently editing the drop list. Uses a negative filter regex for the drop list.
drop_column_names = ['A','B.+','C.*']
drop_columns_regex = '^(?!(?:'+'|'.join(drop_column_names)+')$)'
print('Dropping columns:',', '.join([c for c in df.columns if re.search(drop_columns_regex,c)]))
df = df.filter(regex=drop_columns_regex,axis=1)