How to row-wise concatenate several columns containing strings? - python

I have a specific series of datasets which come in the following general form:
import pandas as pd
import random
df = pd.DataFrame({'n': random.sample(xrange(1000), 3), 't0':['a', 'b', 'c'], 't1':['d','e','f'], 't2':['g','h','i'], 't3':['i','j', 'k']})
The number of tn columns (t0, t1, t2 ... tn) varies depending on the dataset, but is always <30.
My aim is to merge the content of the tn columns for each row so that I achieve this result (note that for readability I need to keep the whitespace between elements):
df['result'] = df.t0 +' '+df.t1+' '+df.t2+' '+ df.t3
So far so good. This code may be simple but it becomes clumsy and inflexible as soon as I receive another dataset, where the number of tn columns goes up. This is where my question comes in:
Is there any other syntax to merge the content across multiple columns? Something agnostic to the number columns, akin to:
df['result'] = ' '.join(df.ix[:,1:])
Basically, I want to achieve the same as the OP in the link below, but with whitespace between the strings:
Concatenate row-wise across specific columns of dataframe

The key to operate in columns (Series) of strings en mass is the Series.str accessor.
I can think of two .str methods to do what you want.
str.cat()
The first is str.cat. You have to start from a series, but you can pass a list of series (unfortunately you can't pass a dataframe) to concatenate with an optional separator. Using your example:
column_names = df.columns[1:] # skipping the first, numeric, column
series_list = [df[c] for c in column_names[1:]]
# concatenate:
df['result'] = series_list[0].str.cat(series_list[1:], sep=' ')
Or, in one line:
df['result'] = df[df.columns[1]].str.cat([df[c] for c in df.columns[2:]], sep=' ')
str.join()
The second is the .str.join() method, which works like the standard Python method string.join(), but for which you need to have a column (Series) of iterables, for example, a column of tuples, which we can get by applying tuples row-wise to a sub-dataframe of the columns you're interested in:
tuple_series = df[column_names].apply(tuple, axis=1)
df['result'] = tuple_series.str.join(' ')
Or, in one line:
df['result'] = df[df.columns[1:]].apply(tuple, axis=1).str.join(' ')
BTW, don't try the above with list instead of tuple. As of pandas-0.20.1, if the function passed into the Dataframe.apply() method returns a list and the returned list has the same number entries as the columns of the original (sub)dataframe, Dataframe.apply() returns a Dataframe instead of a Series.

Other than using apply to concatenate the strings, you can also use agg to do so.
df[df.columns[1:]].agg(' '.join, axis=1)
Out[118]:
0 a d g i
1 b e h j
2 c f i k
dtype: object

Here is a slightly alternative solution:
In [57]: df['result'] = df.filter(regex=r'^t').apply(lambda x: x.add(' ')).sum(axis=1).str.strip()
In [58]: df
Out[58]:
n t0 t1 t2 t3 result
0 92 a d g i a d g i
1 916 b e h j b e h j
2 363 c f i k c f i k

Related

Same comparison over two DataFrame columns to form a mask

I have a pandas Dataframe with columns col1 and col2. I am trying to build col3 as:
df["col3"] = (df["col1"] == 1) | (df["col2"] ==1)
and it works. I tried to rewrite it as:
df["col3"] = any([df[c] == 1 for c in ["col1", "col2"]])
but I get the infamous ValueError: The truth value of a series is ambiguous ...
I even tried to rewrite any( .. ) as pd.Series( .. ).any(), but it did not work.
How would you do it?
SImpliest is compare all columns filtered in list for boolean DataFrame and add DataFrame.any:
(df[["col1", "col2"]] == 1).any(axis=1)
Your solution should be changed by np.logical_or.reduce:
np.logical_or.reduce([df[c] == 1 for c in ["col1", "col2"]])
Or a bit overcomplicated:
pd.concat([df[c] == 1 for c in ["col1", "col2"]], axis=1).any(axis=1)
As was already explained in the comments, the any function implicitly tries (and fails) to convert a series to bool
If you want to have something similar to your second code snippet, you can use numpy's any function as this supports only a single axis.
import numpy
np.any([df[c] == 1 for c in ["col1", "col2"]], axis=0)
Alternatively, you could also extend your first code snippet to more columns by using reduce
In [6]: import functools
In [7]: functools.reduce(lambda a, b: a | b, [(df[c] == 1) for c in ['col1', 'col2']])

Concat multiples dataframes within a list

I have several dataframes in a list, obtained after using np.array_split and I want to concat some of then into a single dataframe. In this example, I want to concat 3 dataframes contained in b (all but the 2nd one, which is the element b[1] in the list):
df = pd.DataFrame({'country':['a','b','c','d'],
'gdp':[1,2,3,4],
'iso':['x','y','z','w']})
a = np.array_split(df,4)
i = 1
b = a[:i]+a[i+1:]
desired_final_df = pd.DataFrame({'country':['a','c','d'],
'gdp':[1,3,4],
'iso':['x','z','w']})
I have tried to create an empty df and then use append through a loop for the elements in b but with no complete success:
CV = pd.DataFrame()
CV = [CV.append[(b[i])] for i in b] #try1
CV = [CV.append(b[i]) for i in b] #try2
CV = pd.DataFrame([CV.append[(b[i])] for i in b]) #try3
for i in b:
CV.append(b) #try4
I have reached to a solution which works but it is not efficient:
CV = pd.DataFrame()
CV = [CV.append(b) for i in b][0]
In this case, I get in CV three times the same dataframe with all the rows and I just get the first of them. However, in my real case, in which I have big datasets, having three times the same would result in much more time of computation.
How could I do that without repeating operations?
According to the docs, DataFrame.append does not work in-place, like lists. The resulting DataFrame object is returned instead. Catching that object should be enough for what you need:
df = pd.DataFrame()
for next_df in list_of_dfs:
df = df.append(next_df)
You may want to use the keyword argument ignore_index=True in the append call so that the indices become continuous, instead of starting from 0 for each appended DataFrame (assuming that the index of the DataFrames in the list all start from 0).
To cancatenate multiple DFs, resetting index, use pandas.concat:
pd.concat(b, ignore_index=True)
output
country gdp iso
0 a 1 x
1 c 3 z
2 d 4 w

How to split one column in a list into new columns when each list may have different number of members?

So I have a dataframe in pandas with many columns.
One column has a list with strings deliminated with [u'str',] as shown below. There aren't equal number of strings in each row.
column x
[u'str1', u'str2', u'str3']
[u'str4', u'str1']
[u'str5', u'str7', u'str8', u'str9']
I want to create new columns in the dataframe called column x-1, column x-2 up to x-n
How do I:
Figure out how many new columns I need (i.e. how many members the biggest list has?)
Create that many columns using the nomenclature mentioned.
most importantly: split the strings into new the columns, only leaving what's between the single quotes (i.e. lose the u, the ', and the comma)
If "column x" is the column of lists, you can pass the column as a Series to create a new DataFrame.
df['column x']
0 [a, b, c]
1 [d]
2 [e, f]
dtype: object
df2 = pd.DataFrame(
df['column x'].tolist()).rename(lambda x: 'x-{}'.format(x + 1), axis=1)
df2
x-1 x-2 x-3
0 a b c
1 d None None
2 e f None
To add these columns back to df, use pd.concat:
df = pd.concat([df, df2, axis=1])
So the exact code to this question is:
df_test['actors_list'] = df_m.actors_list.str.split('u\'') #splits based on deliminator u' (the \ is the escape character)
df_test2 = pd.DataFrame(
df_test['actors_list'].tolist()).rename(lambda x: 'actors_list-{}'.format(x + 1), axis=1)
df_test2

Fastest method of finding and replacing row-specific data in a pandas DataFrame

Given an example pandas DataFrame:
Index | sometext | a | ff |
0 'asdff' 'b' 'g'
1 'asdff' 'c' 'hh'
2 'aaf' 'd' 'i'
What would be the fastest way to replace all instances of the columns names in the [sometext] field with the data in that column, where the values to replace are row specific?
i.e. the desired result from the above input would be:
Index | sometext | a | ff |
0 'bsdg' 'b' 'g'
1 'csdhh' 'c' 'hh'
2 'ddf' 'd' 'i'
note: there is no chance the replacement values would include column names.
I have tried iterating over the rows but the execution time blows out as the length of the DataFrame and number of replacement columns increases.
the Series.str.replace method looks at single values as well so would need to be run over each row.
We can do this ..
df.apply(lambda x : pd.Series(x['sometext']).replace({'a':x['a'],'ff':x['ff']},regex=True),1)
Out[773]:
0
0 bsdg
1 csdhh
2 ddf
This way seems quite fast. See below for a brief discussion.
import re
df['new'] = df['sometext']
for v in ['a','ff']:
df['new'] = df.apply( lambda x: re.sub( v, x[v], x['new']), axis=1 )
Results:
sometext a ff new
0 asdff b g bsdg
1 asdff c hh csdhh
2 aaf d i ddf
Discussion:
I expanded the sample to 15,000 rows and this was the fastest approach by around 10x or more compared to the existing answers (although I suspect there might be even faster ways).
The fact that you want to use the columns to make row specific substitutions is what complicates this answer (otherwise you would just do a simpler version of #wen's df.replace). As it is, that simple and fast approach requires further code in both my approach and wen's although I think they are more or less working the same way.
I have the following:
d = {'sometext': ['asdff', 'asdff', 'aaf'], 'a': ['b', 'c', 'd'], 'ff':['g', 'hh', 'i']}
df = pd.DataFrame(data=d)
start = timeit.timeit()
def replace_single_string(row_label, original_column, final_column):
result_1 = df.get_value(row_label, original_column)
result_2 = df.get_value(row_label, final_column)
if 'a' in result_1:
df.at[row_label, original_column] = result_1.replace('a', result_2)
else:
pass
return df
for i in df.index.values:
df = replace_single_string(i, 'sometext', 'a')
print df
end = timeit.timeit()
print end - start
This ran in 0.000404119491577 seconds in Terminal.
The fastest method I found was to use the apply function in tandem with a replacer function that uses the basic str.replace() method. It's very fast, even with a for loop inside it, and it also allows for a dynamic amount of columns:
def value_replacement(df_to_replace, replace_col):
""" replace the <replace_col> column of a dataframe with the values in all other columns """
cols = [col for col in df_to_replace.columns if col != replace_col]
def replacer(rep_df):
""" function to by used in the apply function """
for col in cols:
rep_df[replace_col] = \
str(rep_df[replace_col]).replace(col.lower(), str(rep_df[col]))
return rep_df[replace_col]
df_to_replace[replace_col] = df_to_replace.apply(replacer, axis=1)
return df_to_replace

Filter string data based on its string length

I like to filter out data whose string length is not equal to 10.
If I try to filter out any row whose column A's or B's string length is not equal to 10, I tried this.
df=pd.read_csv('filex.csv')
df.A=df.A.apply(lambda x: x if len(x)== 10 else np.nan)
df.B=df.B.apply(lambda x: x if len(x)== 10 else np.nan)
df=df.dropna(subset=['A','B'], how='any')
This works slow, but is working.
However, it sometimes produce error when the data in A is not a string but a number (interpreted as a number when read_csv read the input file):
File "<stdin>", line 1, in <lambda>
TypeError: object of type 'float' has no len()
I believe there should be more efficient and elegant code instead of this.
Based on the answers and comments below, the simplest solution I found are:
df=df[df.A.apply(lambda x: len(str(x))==10]
df=df[df.B.apply(lambda x: len(str(x))==10]
or
df=df[(df.A.apply(lambda x: len(str(x))==10) & (df.B.apply(lambda x: len(str(x))==10)]
or
df=df[(df.A.astype(str).str.len()==10) & (df.B.astype(str).str.len()==10)]
import pandas as pd
df = pd.read_csv('filex.csv')
df['A'] = df['A'].astype('str')
df['B'] = df['B'].astype('str')
mask = (df['A'].str.len() == 10) & (df['B'].str.len() == 10)
df = df.loc[mask]
print(df)
Applied to filex.csv:
A,B
123,abc
1234,abcd
1234567890,abcdefghij
the code above prints
A B
2 1234567890 abcdefghij
A more Pythonic way of filtering out rows based on given conditions of other columns and their values:
Assuming a df of:
data = {
"names": ["Alice", "Zac", "Anna", "O"],
"cars": ["Civic", "BMW", "Mitsubishi", "Benz"],
"age": ["1", "4", "2", "0"],
}
df=pd.DataFrame(data)
df:
age cars names
0 1 Civic Alice
1 4 BMW Zac
2 2 Mitsubishi Anna
3 0 Benz O
Then:
df[
df["names"].apply(lambda x: len(x) > 1)
& df["cars"].apply(lambda x: "i" in x)
& df["age"].apply(lambda x: int(x) < 2)
]
We will have :
age cars names
0 1 Civic Alice
In the conditions above we are looking first at the length of strings, then we check whether a letter "i" exists in the strings or not, finally, we check for the value of integers in the first column.
I personally found this way to be the easiest:
df['column_name'] = df[df['column_name'].str.len()!=10]
You can also use query:
df.query('A.str.len() == 10 & B.str.len() == 10')
If You have numbers in rows, then they will convert as floats.
Convert all the rows to strings after importing from cvs. For better performance split that lambdas into multiple threads.
you can use df.apply(len) . it will give you the result
For string operations such as this, vanilla Python using built-in methods (without lambda) is much faster than apply() or str.len().
Building a boolean mask by mapping len to each string inside a list comprehension is approx. 40-70% faster than apply() and str.len() respectively.
For multiple columns, zip() allows to evaluate values from different columns concurrently.
col_A_len = map(len, df['A'].astype(str))
col_B_len = map(len, df['B'].astype(str))
m = [a==3 and b==3 for a,b in zip(col_A_len, col_B_len)]
df1 = df[m]
For a single column, drop zip() and loop over the column and check if the length is equal to 3:
df2 = df[[a==3 for a in map(len, df['A'].astype(str))]]
This code can be written a little concisely using the Series.map() method (but a little slower than list comprehension due to pandas overhead):
df2 = df[df['A'].astype(str).map(len)==3]
Filter out values other than length of 10 from column A and B, here i pass lambda expression to map() function. map() function always applies in Series Object.
df = df[df['A'].map(lambda x: len(str(x)) == 10)]
df = df[df['B'].map(lambda x: len(str(x)) == 10)]
You could use applymap to filter all columns you want at once, followed by the .all() method to filter only the rows where both columns are True.
#The *mask* variable is a dataframe of booleans, giving you True or False for the selected condition
mask = df[['A','B']].applymap(lambda x: len(str(x)) == 10)
#Here you can just use the mask to filter your rows, using the method *.all()* to filter only rows that are all True, but you could also use the *.any()* method for other needs
df = df[mask.all(axis=1)]

Categories

Resources