Split a column of a dataframe into two separate columns - python

I'd like to split a column of a dataframe into two separate columns. Here is how my dataframe looks like (only the first 3 rows):
I'd like to split the column referenced_tweets into two columns: type and id in a way that for example, for the first row, the value of the type column would be replied_to and the value of id would be 1253050942716551168.
Here is what I've tried:
df[['type', 'id']] = df['referenced_tweets'].str.split(',', n=1, expand=True)
but I get the error:
ValueError: Columns must be the same length as key
(I think I get this error because the type in the referenced_tweets column is NOT always replied_to (e.g., it can be retweeted, and therefore, the lengths would be different)

Why not get the values from the dict and add it two new columns?
def unpack_column(df_series, key):
""" Function that unpacks the key value of your column and skips NaN values """
return [None if pd.isna(value) else value[0][key] for value in df_series]
df['type'] = unpack_column(df['referenced_tweets'], 'type')
df['id'] = unpack_column(df['referenced_tweets'], 'id')
or in a one-liner:
df[['type', 'id']] = df['referenced_tweets'].apply(lambda x: (x[0]['type'], x[0]['id']))

Related

How to check pandas column names and then append to row data efficiently?

I have a dataframe with several columns, some of which have names that match the keys in a dictionary. I want to append the value of the items in the dictionary to the non null values of the column whos name matches the key in said dictionary. Hopefully that isn't too confusing.
example:
realms = {}
realms['email'] = '<email>'
realms['android'] = '<androidID>'
df = pd.DataFrame()
df['email'] = ['foo#gmail.com','',foo#yahoo.com]
df['android'] = [1234567,,55533321]
how could I could I append '<email>' to 'foo#gmail.com' and 'foo#yahoo.com'
without appending to the empty string or null value too?
I'm trying to do this without using iteritems(), as I have about 200,000 records to apply this logic to.
expected output would be like 'foo#gmail.com<email>',,'foo#yahoo.com<email>'
for column in df.columns:
df[column] = df[column].astype(str) + realms[column]
>>> df
email android
0 foo#gmail.com<email> 1234567<androidID>
1 foo#yahoo.com<email> 55533321<androidID>

How to clean dataframe column filled with names using Python?

I have the following dataframe:
df = pd.DataFrame( columns = ['Name'])
df['Name'] = ['Aadam','adam','AdAm','adammm','Adam.','Bethh','beth.','beht','Beeth','Beth']
I want to clean the column in order to achieve the following:
df['Name Corrected'] = ['adam','adam','adam','adam','adam','beth','beth','beth','beth','beth']
df
Cleaned names are based on the following reference table:
ref = pd.DataFrame( columns = ['Cleaned Names'])
ref['Cleaned Names'] = ['adam','beth']
I am aware of fuzzy matching but I'm not sure if that's the most efficient way of solving the problem.
You can try:
lst=['adam','beth']
out=pd.concat([df['Name'].str.contains(x,case=False).map({True:x}) for x in lst],axis=1)
df['Name corrected']=out.bfill(axis=1).iloc[:,0]
#Finally:
df['Name corrected']=df['Name corrected'].ffill()
#but In certain condition ffill() gives you wrong values
Explaination:
lst=['adam','beth']
#created a list of words
out=pd.concat([df['Name'].str.contains(x,case=False).map({True:x}) for x in lst],axis=1)
#checking If the 'Name' column contain the word one at a time that are inside the list and that will give a boolean series of True and False and then we are mapping The value of that particular element that is inside list so True becomes that value and False become NaN and then we are concatinating both list of Series on axis=1 so that It becomes a Dataframe
df['Name corrected']=out.bfill(axis=1).iloc[:,0]
#Backword filling values on axis=1 and getting the 1st column
#Finally:
df['Name corrected']=df['Name corrected'].ffill()
#Forward filling the missing values

str.split error columns must be same length as key, multiple columns

I have dataframe df_ip_month and am trying to split the column detail EOBs into multiple columns on the same row for each Account such that this
becomes this
The code I am trying to use is
df_ip_month[['eob1','eob2','eob3','eob4','eob5','eob6']] = df_ip_month['detail EOBs'].str.split(expand=True)
However, no output is generated, only the following error
ValueError: Columns must be same length as key
How come?
Here is the dataset
df_ip_month = pd.DataFrame({'Account': ['H5000570011700','H5000484349900','H5000500029400','H5000502860000','H5000631774400','H5000619680500',
'H5000587425100','H5000630746300','H5000632095500','H5000505467800','H5000558994900','H5000623617700',
'H5000539983300','H5000559033600','H5000570061901','H5000513787300','H5000562451100','H5000568554900'],
'detail EOBs': ['','','','','','5002','','5002 1442','','5003','','','','5002 3035 9932 3343 3021 2312','','','5003 3035 9932','']})
df = df.set_index('Account')
df2 = df['detail_EOBs'].str.split(expand=True)
df.join(df2)
You can set your index on your Account column, split and expand your detail_EOBs column, then join them back together.

Select columns based on != condition

I have a dataframe and I have a list of some column names that correspond to the dataframe. How do I filter the dataframe so that it != the list of column names, i.e. I want the dataframe columns that are outside the specified list.
I tried the following:
quant_vair = X != true_binary_cols
but get the output error of: Unable to coerce to Series, length must be 545: given 155
Been battling for hours, any help will be appreciated.
It will help:
df.drop(columns = ["col1", "col2"])
You can either drop the columns from the dataframe, or create a list that does not contain all these columns:
df_filtered = df.drop(columns=true_binary_cols)
Or:
filtered_col = [col for col in df if col not in true_binary_cols]
df_filtered = df[filtered_col]

Adding columns dynamically to a pandas dataframe, from a list contained in the dataframe

I have a dataframe in which the first column contains a list of random size, from 0 to around 10 items in each list. This dataframe also contains several other columns of data.
I would like to insert as many columns as the length of the longest list, and then populate the values across sequentially such that each column has one item from the list in column one.
I was unsure of a good way to go about this.
sample = [[[0,2,3,7,8,9],2,3,4,5],[[1,2],2,3,4,5],[[1,3,4,5,6,7,8,9,0],2,3,4,5]]
headers = ["col1","col2","col3","col4","col5"]
df = pd.DataFrame(sample, columns = headers)
In this example I would like to add 9 columns after column 1, as this is the maxiumum length of the list in the third row of the dataframe. These columns would be populated with:
0 2 3 7 8 9 NULL NULL NULL in the first row,
1 2 NULL NULL NULL NULL NULL NULL NULL in the second, etc...
Edit to fit OPs edit
This is how I would do it. First I would pad the lists of the original column so that they're all the same length and it's easier to work with them. Afterwards it's a matter of creating the columns and filling it with the value corresponding to the position in the list. Let's say our lists are of size up to 4 for an easier example:
df = pd.DataFrame(sample, columns = headers)
df = df.rename(columns={'col1':'col_of_lists'})
max_length = max(df['col_of_lists'].apply(lambda x:len(x)))
df['col_of_lists'] = df['col_of_lists'].apply(lambda x:x + ([np.nan] * (max_length - len(x))))
for i in range(max_length):
df['col_'+str(i)] = df['col_of_lists'].apply(lambda x: x[i])
The easiest way to turn a series of lists into separate columns is to use apply to convert them into a Series, which triggers the 'expand' result type:
result = df['col1'].apply(pd.Series)
At this point, we can adjust the columns from the automatically numbered to include the name of the original 'col1', for example:
result.columns = [
'col1_{}'.format(i + 1)
for i in result.columns]
Finally, we can join it back to the original DataFrame. Using the fact that this was the first column makes it easy, just joining it to the left of the original frame, dropping the original 'col1' in the process:
result = result.join(df.drop('col1', axis=1))
You can even do it all as a one-liner, by using the rename() method to change column names:
df['col1'].apply(pd.Series).rename(
lambda i: 'col1_{}'.format(i + 1),
axis=1,
).join(df.drop('col1', axis=1))

Categories

Resources