Pasting selected columns into new column with separator in Pandas

Pasting selected columns into new column with separator in Pandas - python

I have the following DF:
import pandas as pd
df = pd.DataFrame({'col1' : ["a","b"],
'col2' : ["ab","XX"], 'col3' : ["w","e"], 'col4':["foo","bar"]})
Which looks like this:
In [8]: df
Out[8]:
col1 col2 col3 col4
0 a ab w foo
1 b XX e bar
What I want to do is to combine col2, 3, 4 into a new column called ID
col1 col2 col3 col4 ID
0 a ab w foo ab.w.foo
1 b XX e bar XX.e.bar
How can I achieve that?
I tried this but failed:
df["ID"] = df.apply(lambda x: '.'.join(["col2","col3","col4"]),axis=1)
In [10]: df
Out[10]:
col1 col2 col3 col4 ID
0 a ab w foo col2.col3.col4
1 b XX e bar col2.col3.col4

Use x[['col2', 'col3', 'col4']]
In [54]: df.apply(lambda x: '.'.join(x[['col2', 'col3', 'col4']]),axis=1)
Out[54]:
0 ab.w.foo
1 XX.e.bar
dtype: object

A small typo in your code, you should use the x that is being passed into the lambda function to access those values :
In [29]: df["ID"] = df.apply(lambda x: '.'.join([x['col2'],x['col3'],x['col4']]),axis=1)
In [30]: df
Out[30]:
col1 col2 col3 col4 ID
0 a ab w foo ab.w.foo
1 b XX e bar XX.e.bar

A little bit simpler which runs faster:
df['id'] = df.col2 + '.' + df.col3 + '.' + df.col4
Illustrative timing with 10000 rows:
>>> t1 = timeit.timeit("df['id'] = df.col2 + '.' + df.col3 +'.' + df.col4", "from __main__ import pd,df", number=100)
Yields 0.00221121072769s per loop
>>> t2 = timeit.timeit("df.apply(lambda x: '.'.join(x[['col2', 'col3', 'col4']]), axis=1)","from __main__ import pd,df", number=100)
Yields 3.32903954983s per loop

Related

df.iterrows() if condition not working on a dataframe?

I have dataframe I am trying to split col1 string value if value contains ":" and take first element and then put it into another col2 like this:
df['col1'] = df['col1'].astype(str)
df['col2'] = df['col1'].astype(str)
for i, row in df.iterrows():
if (":") in row['col1']:
row['col2'] = row['col1'].split(":")[1]+" "+ "in Person"
row['col1'] = 'j'
It is working on sample dataframe like this but It doesn't change the result in origional dataframe--
import pandas as pd
d = {'col1': ['a:b', 'ac'], 'col2': ['z 26', 'y 25']}
df = pd.DataFrame(data=d)
print(df)
col1 col2
j b in Person
ac y 25
what I am doing wrong and what are alternatives for this condition.

For the extracting part, try:
df['col2'] = df.col1.str.extract(r':(.+)', expand=False).add(' ').add(df.col2, fill_value='')
# Output
col1 col2
0 a:b b z 26
1 ac y 25
I'm not sure if I understand the replacing correctly, but here is a try:
df.loc[df.col1.str.contains(':'), 'col1'] = 'j'
# Output
col1 col2
0 j b z 26
1 ac y 25

Filter multiple dataframes with criteria from list using loop

The code below creates multiple empty dataframes named from the report2 list. They are then populated with a filtered existing dataframe called dfsource.
With a nested for loop, I'd like to filter each of these dataframes using a list of values but the sub loop does not work as shown.
import pandas as pd
report=['A','B','C']
suffix='_US'
report2=[s + suffix for s in report]
print (report2) #result: ['A_US', 'B_US', 'C_US']
source = {'COL1': ['A','B','C'], 'COL2': ['D','E','F']}
dfsource=pd.DataFrame(source)
print(dfsource)
df_dict = {}
for i in report2:
df_dict[i]=pd.DataFrame()
for x in report:
df_dict[i]=dfsource.query('COL1==x')
#df_dict[i]=dfsource.query('COL1=="A"') #Example, this works filtering for value A but not what I need.
print(df_dict['A_US'])
print(df_dict['B_US'])
print(df_dict['C_US'])

You can reference a variable in a query by using #
df_dict[i]=dfsource.query('COL1==#x')
So the total code looks like this
import pandas as pd
report=['A','B','C']
suffix='_US'
report2=[s + suffix for s in report]
print (report2) #result: ['A_US', 'B_US', 'C_US']
source = {'COL1': ['A','B','C'], 'COL2': ['D','E','F']}
dfsource=pd.DataFrame(source)
print(dfsource)
df_dict = {}
for i in report2:
df_dict[i]=pd.DataFrame()
for x in report:
df_dict[i]=dfsource.query('COL1==#x')
#df_dict[i]=dfsource.query('COL1=="A"') #Example, this works filtering for value A but not what I need.
print(df_dict['A_US'])
print(df_dict['B_US'])
print(df_dict['C_US'])
which outputs
COL1 COL2
0 A D
1 B E
2 C F
COL1 COL2
2 C F
COL1 COL2
2 C F
COL1 COL2
2 C F
However, I think you want to create a new dictionary based on the i and x of each list, then you can move the creation of the dataframe to the second for loop and then create a new key for each iteration.
import pandas as pd
report=['A','B','C']
suffix='_US'
report2=[s + suffix for s in report]
print (report2) #result: ['A_US', 'B_US', 'C_US']
source = {'COL1': ['A','B','C'], 'COL2': ['D','E','F']}
dfsource=pd.DataFrame(source)
print(dfsource)
df_dict = {}
for i in report2:
for x in report:
new_key = x + i
df_dict[new_key]=pd.DataFrame()
df_dict[new_key]=dfsource.query('COL1==#x')
for item in df_dict.items():
print(item)
Outputs 9 unique dataframes which are filtered based on whatever x value was passed.
('AA_US', COL1 COL2
0 A D)
('BA_US', COL1 COL2
1 B E)
('CA_US', COL1 COL2
2 C F)
('AB_US', COL1 COL2
0 A D)
('BB_US', COL1 COL2
1 B E)
('CB_US', COL1 COL2
2 C F)
('AC_US', COL1 COL2
0 A D)
('BC_US', COL1 COL2
1 B E)
('CC_US', COL1 COL2
2 C F)

Pivoting string into more columns using Pandas

My table looks like the following:
import pandas as pd
d = {'col1': ['a>b>c']}
df = pd.DataFrame(data=d)
print(df)
"""
col1
0 a>b>c
"""
and my desired output need to be like this:
d1 = {'col1': ['a>b>c'],'col11': ['a'],'col12': ['b'],'col13': ['c']}
d1 = pd.DataFrame(data=d1)
print(d1)
"""
col1 col11 col12 col13
0 a>b>c a b c
"""
I have to run .split('>') method but then I don't know how to go on. Any help?

You can simply split using str.split('>')and expand the dataframe
import pandas as pd
d = {'col1': ['a>b>c'],'col2':['a>b>c']}
df = pd.DataFrame(data=d)
print(df)
col='col1'
#temp = df[col].str.split('>',expand=True).add_prefix(col)
temp = df[col].str.split('>',expand=True).rename(columns=lambda x: col + str(int(x)+1))
temp.merge(df,left_index=True,right_index=True,how='outer')
Out:
col1 col11 col12 col13
0 a>b>c a b c
Incase if you want to do it on multiple columns you can also take
for col in df.columns:
temp = df[col].str.split('>',expand=True).rename(columns=lambda x: col + str(int(x)+1))
df = temp.merge(df,left_index=True,right_index=True,how='outer')
Out:
col21 col22 col23 col11 col12 col13 col1 col2
0 a b c a b c a>b>c a>b>c

Using split:
d = {'col1': ['a>b>c']}
df = pd.DataFrame(data=d)
df = pd.concat([df, df.col1.str.split('>', expand=True)], axis=1)
df.columns = ['col1', 'col11', 'col12', 'col13']
df
Output:
col1 col11 col12 col13
0 a>b>c a b c

Count unique symbols per column in Pandas

I was wondering how to calculate the number of unique symbols that occur in a single column in a dataframe. For example:
df = pd.DataFrame({'col1': ['a', 'bbb', 'cc', ''], 'col2': ['ddd', 'eeeee', 'ff', 'ggggggg']})
df col1 col2
0 a ddd
1 bbb eeeee
2 cc ff
3 gggggg
It should calculate that col1 contains 3 unique symbols, and col2 contains 4 unique symbols.
My code so far (but this might be wrong):
unique_symbols = [0]*203
i = 0
for col in df.columns:
observed_symbols = []
df_temp = df[[col]]
df_temp = df_temp.astype('str')
#This part is where I am not so sure
for index, row in df_temp.iterrows():
pass
if symbol not in observed_symbols:
observed_symbols.append(symbol)
unique_symbols[i] = len(observed_symbols)
i += 1
Thanks in advance

Option 1
str.join + set inside a dict comprehension
For problems like this, I'd prefer falling back to python, because it's so much faster.
{c : len(set(''.join(df[c]))) for c in df.columns}
{'col1': 3, 'col2': 4}
Option 2
agg
If you want to stay in pandas space.
df.agg(lambda x: set(''.join(x)), axis=0).str.len()
Or,
df.agg(lambda x: len(set(''.join(x))), axis=0)
col1 3
col2 4
dtype: int64

Here is one way:
df.apply(lambda x: len(set(''.join(x.astype(str)))))
col1 3
col2 4

Maybe
df.sum().apply(set).str.len()
Out[673]:
col1 3
col2 4
dtype: int64

One more option:
In [38]: df.applymap(lambda x: len(set(x))).sum()
Out[38]:
col1 3
col2 4
dtype: int64

Find unique values for each column

I am looking to find the unique values for each column in my dataframe. (Values unique for the whole dataframe)
Col1 Col2 Col3
1 A A B
2 C A B
3 B B F
Col1 has C as a unique value, Col2 has none and Col3 has F.
Any genius ideas ? thank you !

You can use stack for Series, then drop_duplicates - keep=False remove all, remove first level by reset_index and last reindex:
df = df.stack()
.drop_duplicates(keep=False)
.reset_index(level=0, drop=True)
.reindex(index=df.columns)
print (df)
Col1 C
Col2 NaN
Col3 F
dtype: object
Solution above works nice if only one unique value per column.
I try create more general solution:
print (df)
Col1 Col2 Col3
1 A A B
2 C A X
3 B B F
s = df.stack().drop_duplicates(keep=False).reset_index(level=0, drop=True)
print (s)
Col1 C
Col3 X
Col3 F
dtype: object
s = s.groupby(level=0).unique().reindex(index=df.columns)
print (s)
Col1 [C]
Col2 NaN
Col3 [X, F]
dtype: object

I don't believe this is exactly what you want, but as useful information - you can find unique values for a DataFrame using numpy's .unique() like so:
>>> np.unique(df[['Col1', 'Col2', 'Col3']])
['A' 'B' 'C' 'F']
You can also get unique values of a specific column, e.g. Col3:
>>> df.Col3.unique()
['B' 'F']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pasting selected columns into new column with separator in Pandas - python

Use x[['col2', 'col3', 'col4']] In [54]: df.apply(lambda x: '.'.join(x[['col2', 'col3', 'col4']]),axis=1) Out[54]: 0 ab.w.foo 1 XX.e.bar dtype: object

A small typo in your code, you should use the x that is being passed into the lambda function to access those values : In [29]: df["ID"] = df.apply(lambda x: '.'.join([x['col2'],x['col3'],x['col4']]),axis=1) In [30]: df Out[30]: col1 col2 col3 col4 ID 0 a ab w foo ab.w.foo 1 b XX e bar XX.e.bar

Related

df.iterrows() if condition not working on a dataframe?

Filter multiple dataframes with criteria from list using loop

Pivoting string into more columns using Pandas

Count unique symbols per column in Pandas

Find unique values for each column

Categories

Resources