Pandas - Splitting text using a delimiter - python

Given below is the view of my Dataframe
Id,user_id
1,glen-max
2,tom-moody
I am trying to split the values in user_id column and store it in a new column.
I am able to split the user_id using the below code.
z = z['user_id'].str.split('-', 1, expand=True)
I would like this split column to be part of my original Dataframe.
Given below is the expected format of the Dataframe
Id,user_id,col1,col2
1,glen-max,glen,max
2,tom-moody,tom,moody
Could anyone help how I could make it part of the original Dataframe. Tnx..

General solution is possible multiple -:
df = z.join(z['user_id'].str.split('-', 1, expand=True).add_prefix('col'))
print (df)
Id user_id col0 col1
0 1 glen-max glen max
1 2 tom-moody tom moody
If always maximal one - is possible use:
z[['col1','col2']] = z['user_id'].str.split('-', 1, expand=True)
print (z)
Id user_id col1 col2
0 1 glen-max glen max
1 2 tom-moody tom moody

Using str.split
Ex:
import pandas as pd
df = pd.read_csv(filename, sep=",")
df[["col1","col2"]] = df['user_id'].str.split('-', 1, expand=True)
print(df)
Output:
Id user_id col1 col2
0 1 glen-max glen max
1 2 tom-moody tom moody

Related

Dropping column if more than half of the values are same - Python

I have pandas df which looks like the pic:
enter image description here
I want to delete any column if more than half of the values are the same in the column, and I dont know how to do this
I trid using :pandas.Series.value_counts
but with no luck
You can iterate over the columns, count the occurences of values as you tried with value counts and check if it is more than 50% of your column's data.
n=len(df)
cols_to_drop=[]
for e in list(df.columns):
max_occ=df['id'].value_counts().iloc[0] #Get occurences of most common value
if 2*max_occ>n: # Check if it is more than half the len of the dataset
cols_to_drop.append(e)
df=df.drop(cols_to_drop,axis=1)
You can use apply + value_counts and getting the first value to get the max count:
count = df.apply(lambda s: s.value_counts().iat[0])
col1 4
col2 2
col3 6
dtype: int64
Thus, simply turn it into a mask depending on whether the greatest count is more than half len(df), and slice:
count = df.apply(lambda s: s.value_counts().iat[0])
df.loc[:, count.le(len(df)/2)] # use 'lt' if needed to drop if exactly half
output:
col2
0 0
1 1
2 0
3 1
4 2
5 3
Use input:
df = pd.DataFrame({'col1': [0,1,0,0,0,1],
'col2': [0,1,0,1,2,3],
'col3': [0,0,0,0,0,0],
})
Boolean slicing with a comprension
df.loc[:, [
df.shape[0] // s.value_counts().max() >= 2
for _, s in df.iteritems()
]]
col2
0 0
1 1
2 0
3 1
4 2
5 3
Credit to #mozway for input data.

For each distinct value in a given column, count the null and non-null values in another column

Suppose I have the following dataframe:
df = pd.DataFrame({'col1':['x','y','z','x','x','x','y','z','y','y'],
'col2':[np.nan,'n1',np.nan,np.nan,'n3','n2','n5',np.nan,np.nan,np.nan]})
for each distinct element in col1 I want to count how may null and non-null value are there in col2 and summarise the result in a new dataframe. So far I used df1 = df[df['col1']=='x'] and then
print(df1[df1['col2'].isna()].shape[0],
df1[df1['col2'].notna()].shape[0])
I was then manually changing the value in df1 so that df1 = df[df['col1']=='y'] and df1 = df[df['col1']=='z']. Yet my method is not efficient at all. The table I desire should look like the following:
col1 value no value
0 x 2 2
1 y 2 2
2 z 0 2
I have also tried df.groupby('col1').col2.nunique() yet that only gives me result when there is non-null value.
Let us try crosstab to create a frequency table where the index is the unique values in column col1 and columns represent the corresponding counts of non-nan and nan values in col2:
out = pd.crosstab(df['col1'], df['col2'].isna())
out.columns = ['value', 'no value']
>>> out
value no value
col1
x 2 2
y 2 2
z 0 2
Use SeriesGroupBy.value_counts with SeriesGroupBy.value_counts for counts with reshape by Series.unstack and some data cleaning:
df = (df['col2'].isna()
.groupby(df['col1'])
.value_counts()
.unstack(fill_value=0)
.reset_index()
.rename_axis(None, axis=1)
.rename(columns={False:'value', True:'no value'}))
print (df)
col1 value no value
0 x 2 2
1 y 2 2
2 z 0 2

Faster method of extracting characters for multiple columns in dataframe

I have a Panda dataframe with multiple columns that has string data in a format like this:
id col1 col2 col3
1 '1:correct' '0:incorrect' '1:correct'
2 '0:incorrect' '1:correct' '1:correct'
What I would like to do is to extract the numeric character before the colon : symbol. The resulting data should look like this:
id col1 col2 col3
1 1 0 1
2 0 1 1
What I have tried is using regex, like following:
colname = ['col1','col2','col3']
row = len(df)
for col in colname:
df[col] = df[col].str.findall(r"(\d+):")
for i in range(0,row):
df[col].iloc[i] = df[col].iloc[i][0]
df[col] = df[col].astype('int64')
The second loop selects the first and only element in a list created by regex. I then convert the object dtype to integer. This code basically does what I want, but it is way too slow even for a small dataset with few thousand rows. I have heard that loops are not very efficient in Python.
Is there a faster, more Pythonic way of extracting numerics in a string and converting it to integers?
Use Series.str.extract for get first value before : in DataFrame.apply for processing each column by lambda function:
colname = ['col1','col2','col3']
f = lambda x: x.str.extract(r"(\d+):", expand=False)
df[colname] = df[colname].apply(f).astype('int64')
print (df)
id col1 col2 col3
0 1 1 0 1
1 2 0 1 1
Another solution with split and selecting first value before ::
colname = ['col1','col2','col3']
f = lambda x: x.str.strip("'").str.split(':').str[0]
df[colname] = df[colname].apply(f).astype('int64')
print (df)
id col1 col2 col3
0 1 1 0 1
1 2 0 1 1
An option is using list comprehension; since this involves strings, you should get fast speed:
import re
pattern = re.compile(r"\d(?=:)")
result = {key: [int(pattern.search(arr).group(0))
if isinstance(arr, str)
else arr
for arr in value.array]
for key, value in df.items()}
pd.DataFrame(result)
id col1 col2 col3
0 1 1 0 1
1 2 0 1 1

Unique UUID base on n columns of Pandas dataframe (to handle duplicates on ElasticSearch)

i am creating a function to set an UUID column base on the values of other columns. What i want is to handle duplicates when indexind dataframes into Elasticsearch. The UUID should be always the same based on the value of several columns.
I am having problems with the output, the same UUID is generated for each row.
Dataframe
cols = ['col1', 'col2']
data = {'col1': ['Mike','Robert','Sandy'],
'col2': ['100','200','300']}
col1 col2
0 Mike 100
1 Robert 200
2 Sandy 300
Function
def create_uuid_on_n_col (df):
# concat column string values
concat_col_str_id = df.apply(lambda x: uuid.uuid5(uuid.NAMESPACE_DNS,'_'.join(map(str, x))), axis=1)
return concat_col_str_id[0]
Output
df['id'] = create_uuid_2_col(df[['col1','col2']])
col1 col2 id
0 Mike 100 a17ad043-486f-5eeb-8138-8fa2b10659fd
1 Robert 200 a17ad043-486f-5eeb-8138-8fa2b10659fd
2 Sandy 300 a17ad043-486f-5eeb-8138-8fa2b10659fd
There's no need to define another helper function. We can also vectorize the joining of the columns as shown below.
from functools import partial
p = partial(uuid.uuid5, uuid.NAMESPACE_DNS)
df.assign(id=(df.col1 + '_' + df.col2).apply(p))
col1 col2 id
0 Mike 100 a17ad043-486f-5eeb-8138-8fa2b10659fd
1 Robert 200 e520efd5-157a-57ee-84fb-41b9872af407
2 Sandy 300 11208b7c-b99b-5085-ad98-495004e6b043
If you don't want to import partial then define a function.
def custom_uuid(data):
val = uuid.uuid5(uuid.NAMESPACE_DNS, data)
return val
df.assign(id=(df.col1 + '_' + df.col2).apply(custom_uuid))
Using your original function as shown below.
def create_uuid_on_n_col(df):
temp = df.agg('_'.join, axis=1)
return df.assign(id=temp.apply(custom_uuid))
create_uuid_on_n_col(df[['col1','col2']])
col1 col2 id
0 Mike 100 a17ad043-486f-5eeb-8138-8fa2b10659fd
1 Robert 200 e520efd5-157a-57ee-84fb-41b9872af407
2 Sandy 300 11208b7c-b99b-5085-ad98-495004e6b043

How to combine groupby and sort in pandas

I am trying to get one result per 'name' with all of the latest data, unless the column is blank. In R I would have used group_by, sorted by timestamp and selected the latest value for each column. I tried to do that here and got very confused. Can someone explain how to do this in Python? In the example below my goal is:
col2 date name
1 4 2018-03-27 15:55:29 bil #latest timestamp with the latest non-blank col4 value
Heres my code so far:
d = {'name':['bil','bil','bil'],'date': ['2018-02-27 14:55:29', '2018-03-27 15:55:29', '2018-02-28 19:55:29'], 'col2': [3,'', 4]}
df2 = pd.DataFrame(data=d)
print(df2)
grouped = df2.groupby(['name']).sum().reset_index()
print(grouped)
sortedvals=grouped.sort_values(['date'], ascending=False)
print(sortedvals)
Here's one way:
df3 = df2[df2['col2'] != ''].sort_values('date', ascending=False).drop_duplicates('name')
# col2 date name
# 2 4 2018-02-28 19:55:29 bil
However, the dataframe you provided and output you desire seem to be inconsistent.

Categories

Resources