Create new dataframe column from the values of 2 other columns - python

I have 2 columns in my data frame. At any one instance (row), at least one of the columns has a string value in it, it is possible that the other column has NoneType in it or another string.
I want to create a 3rd column that, in the case where one of the columns is a NoneType, will take the value of the string. And in the case where both are strings, will take the concatenation of the two.
How can I do this?
column1 column2 column3
0 hello None hello
1 None goodbye goodbye
2 hello goodbye hello, goodbye

Series.str.cat
Use na_rep='' so joins with missing values do not result in NaN for the entire row. Then strip any excess separators that were joined due to missing data (assuming separator characters also don't start or end any of your words).
import pandas as pd
df = pd.DataFrame({'column1': ['hello', None, 'hello'],
'column2': [None, 'goodbye', 'goodbye']})
sep = ', '
df['column3'] = (df['column1'].str.cat(df['column2'], sep=sep, na_rep='')
.str.strip(sep))
print(df)
column1 column2 column3
0 hello None hello
1 None goodbye goodbye
2 hello goodbye hello, goodbye
With many columns, where there might be streaks of missing data in the middle, the above doesn't work to remove the excess separators. Instead you could use a slow lambda along the rows. We join all values after dropping the nulls:
df['column3'] = df.apply(lambda row: ', '.join(row.dropna()), axis=1)

Solution
You could replace all the NaNs with an empty string and then conact the columns (A and B) to create column C.
df2 = df.fillna('')
df['C'] = df2.A.str.strip() + df2.B.str.strip(); #del df2;
print(df)
Output:
A B C=A+B
0 1 3 13
1 2 None 2
2 dog dog dogdog
3 None None
4 snake 20 snake20
5 cat None cat
Dummy Data
d = {
'A': ['1', '2', 'dog', None, 'snake', 'cat'],
'B': ['3', None, 'dog', None, '20', None]
}
df = pd.DataFrame(d)
print(df)
Output:
A B
0 1 3
1 2 None
2 dog dog
3 None None
4 snake 20
5 cat None

Related

excluding specific rows in pandas dataframe

I am trying to create a new dataframe selecting only those rows which a specific column value does not start with a capital S.
I have tried the following options:
New_dataframe = dataframe.loc[~dataframe.column.str.startswith(('S'))]
filter = dataframe['column'].astype(str).str.contains(r'^\S')
New_dataframe = dataframe[~filter]
However both options return an empty dataframe. Does anybody have a better solution?
Your code works well:
dataframe = pd.DataFrame({'ColA': ['Start', 'Hello', 'World', 'Stop'],
'ColB': [3, 4, 5, 6]})
New_dataframe = dataframe.loc[~df['ColA'].str.startswith('S')]
print(New_dataframe)
Output:
>>> New_dataframe
ColA ColB
1 Hello 4
2 World 5
>>> dataframe
ColA ColB
0 Start 3
1 Hello 4
2 World 5
3 Stop 6

Rename duplicate column name by order in Pandas

I have a dataframe, df, where I would like to rename two duplicate columns in consecutive order:
Data
DD Nice Nice Hello
0 1 1 2
Desired
DD Nice1 Nice2 Hello
0 1 1 2
Doing
df.rename(columns={"Name": "Name1", "Name": "Name2"})
I am running the rename function, however, because both column names are identical, the results are not desirable.
Here's an approach with groupby:
s = df.columns.to_series().groupby(df.columns)
df.columns = np.where(s.transform('size')>1,
df.columns + s.cumcount().add(1).astype(str),
df.columns)
Output:
DD Nice1 Nice2 Hello
0 0 1 1 2
You could use an itertools.count() counter and a list expression to create new column headers, then assign them to the data frame.
For example:
>>> import itertools
>>> df = pd.DataFrame([[1, 2, 3]], columns=["Nice", "Nice", "Hello"])
>>> df
Nice Nice Hello
0 1 2 3
>>> count = itertools.count(1)
>>> new_cols = [f"Nice{next(count)}" if col == "Nice" else col for col in df.columns]
>>> df.columns = new_cols
>>> df
Nice1 Nice2 Hello
0 1 2 3
(Python 3.6+ required for the f-strings)
EDIT: Alternatively, per the comment below, the list expression can replace any label that may contain "Nice" in case there are unexpected spaces or other characters:
new_cols = [f"Nice{next(count)}" if "Nice" in col else col for col in df.columns]
You can use:
cols = pd.Series(df.columns)
dup_count = cols.value_counts()
for dup in cols[cols.duplicated()].unique():
cols[cols[cols == dup].index.values.tolist()] = [dup + str(i) for i in range(1, dup_count[dup]+1)]
df.columns = cols
Input:
col_1 Nice Nice Nice Hello Hello Hello
col_2 1 2 3 4 5 6
Output:
col_1 Nice1 Nice2 Nice3 Hello1 Hello2 Hello3
col_2 1 2 3 4 5 6
Setup to generate duplicate cols:
df = pd.DataFrame(data={'col_1':['Nice', 'Nice', 'Nice', 'Hello', 'Hello', 'Hello'], 'col_2':[1,2,3,4, 5, 6]})
df = df.set_index('col_1').T

Pandas: filling placeholders in string column

I am working with a pandas DataFrame looking as follows:
df = pd.DataFrame(
[['There are # people', '3', np.nan], ['# out of # people are there', 'Five', 'eight'],
['Only # are here', '2', np.nan], ['The rest is at home', np.nan, np.nan]])
resulting in:
0 1 2
0 There are # people 3 NaN
1 # out of # people are there Five eight
2 Only # are here 2 NaN
3 The rest is at home NaN NaN
I would like to replace the # placeholders with the varying strings in columns 1 and 2, resulting in:
0 There are 3 people
1 Five out of eight people are there
2 Only 2 are here
3 The rest is at home
How could I achieve this?
Using string format
df=df.replace({'#':'%s',np.nan:'NaN'},regex=True)
l=[]
for x , y in df.iterrows():
if y[2]=='NaN' and y[1]=='NaN':
l.append(y[0])
elif y[2]=='NaN':
l.append(y[0] % (y[1]))
else:
l.append(y[0] % (y[1], y[2]))
l
Out[339]:
['There are 3 people',
'Five out of eight people are there',
'Only 2 are here',
'The rest is at home']
A more concise way to do it.
cols = df.columns
df[cols[0]] = df.apply(lambda x: x[cols[0]].replace('#',str(x[cols[1]]),1) if x[cols[1]]!=np.NaN else x,axis=1)
print(df.apply(lambda x: x[cols[0]].replace('#',str(x[cols[2]]),1) if x[cols[2]]!=np.NaN else x,axis=1))
Out[12]:
0 There are 3 people
1 Five out of eight people are there
2 Only 2 are here
3 The rest is at home
Name: 0, dtype: object
If you need to do this for even more columns
cols = df.columns
for i in range(1, len(cols)):
df[cols[0]] = df.apply(lambda x: x[cols[0]].replace('#',str(x[cols[i]]),1) if x[cols[i]]!=np.NaN else x,axis=1)
print(df[cols[0]])
A generic replace function in case you may have more values to add:
Replaces all instances if a given character in a string using a list of values (just two in your case but it can handle more)
def replace_hastag(text, values, replace_char='#'):
for v in values:
if v is np.NaN:
return text
else:
text = text.replace(replace_char, str(v), 1)
return text
df['text'] = df.apply(lambda r: replace_hastag(r[0], values=[r[1], r[2]]), axis=1)
Result
In [79]: df.text
Out[79]:
0 There are 3 people
1 Five out of eight people are there
2 Only 2 are here
3 The rest is at home
Name: text, dtype: object

Pandas dataframe isolate row below a keyword

I have a 1 column dataframe
df = pd.read_csv(txt_file, header=None)
I am trying to search for a string in the column and then return the row after
key_word_df = df[df[0].str.contains("KeyWord")]
I dont know how you can then every each time the keyword is found, isolate the row below it and assign to a new df.
You could use the .shift method on an indexer. I've split this into multiple lines to demonstrate what's happening, but you could do the operation in a one-liner for brevity in practice.
import pandas as pd
# 1. Dummy DataFrame with strings
In [1]: df = pd.DataFrame(["one", "two", "one", "two", "three"], columns=["text",])
# 2. Create the indexer, use `shift` to move the values down one and `fillna` to remove NaN values
In [2]: idx = df["text"].str.contains("one").shift(1).fillna(False)
In [3]: idx
Out [3]:
0 False
1 True
2 False
3 True
4 False
Name: text, dtype: bool
# 3. Use the indexer to show the next row from the matched values:
In: [4] df[idx]
Out: [4]
text
1 two
3 two
You can use the shift function. Here's an example
import pandas as pd
df = pd.DataFrame({'word': ['hello', 'ice', 'kitten', 'hello', 'foo', 'bar', 'hello'],
'val': [1,2,3,4,5,6,7]})
val word
0 1 hello
1 2 ice
2 3 kitten
3 4 hello
4 5 foo
5 6 bar
6 7 hello
keyword = 'hello'
df[(df['word']==keyword).shift(1).fillna(False)]
val word
1 2 ice
4 5 foo
Here is one way.
Get the index of the rows that match your condition. Then use .loc to get the matching index + 1.
Consider the following example:
df = pd.DataFrame({0: ['KeyWord', 'foo', 'bar', 'KeyWord', 'blah']})
print(df)
# 0
#0 KeyWord
#1 foo
#2 bar
#3 KeyWord
#4 blah
Apply the mask, and get the rows of the index + 1:
key_word_df = df.loc[df[df[0].str.contains("KeyWord")].index + 1, :]
print(key_word_df)
# 0
#1 foo
#4 blah

Replace all occurrences of a string in a pandas dataframe (Python)

I have a pandas dataframe with about 20 columns.
It is possible to replace all occurrences of a string (here a newline) by manually writing all column names:
df['columnname1'] = df['columnname1'].str.replace("\n","<br>")
df['columnname2'] = df['columnname2'].str.replace("\n","<br>")
df['columnname3'] = df['columnname3'].str.replace("\n","<br>")
...
df['columnname20'] = df['columnname20'].str.replace("\n","<br>")
This unfortunately does not work:
df = df.replace("\n","<br>")
Is there any other, more elegant solution?
You can use replace and pass the strings to find/replace as dictionary keys/items:
df.replace({'\n': '<br>'}, regex=True)
For example:
>>> df = pd.DataFrame({'a': ['1\n', '2\n', '3'], 'b': ['4\n', '5', '6\n']})
>>> df
a b
0 1\n 4\n
1 2\n 5
2 3 6\n
>>> df.replace({'\n': '<br>'}, regex=True)
a b
0 1<br> 4<br>
1 2<br> 5
2 3 6<br>
Note that this method returns a new DataFrame instance by default (it does not modify the original), so you'll need to either reassign the output:
df = df.replace({'\n': '<br>'}, regex=True)
or specify inplace=True:
df.replace({'\n': '<br>'}, regex=True, inplace=True)
It seems Pandas has change its API to avoid ambiguity when handling regex. Now you should use:
df.replace({'\n': '<br>'}, regex=True)
For example:
>>> df = pd.DataFrame({'a': ['1\n', '2\n', '3'], 'b': ['4\n', '5', '6\n']})
>>> df
a b
0 1\n 4\n
1 2\n 5
2 3 6\n
>>> df.replace({'\n': '<br>'}, regex=True)
a b
0 1<br> 4<br>
1 2<br> 5
2 3 6<br>
You can iterate over all columns and use the method str.replace:
for col in df.columns:
df[col] = df[col].str.replace('\n', '<br>')
This method uses regex by default.
This will remove all newlines and unecessary spaces. You can edit the ' '.join to specify a replacement character
df['columnname'] = [''.join(c.split()) for c in df['columnname'].astype(str)]

Categories

Resources