How to split multiple values from a dataframe column into separate columns - python

I have a column with multiple values. I want to split the unique values into multiple columns with headers and then apply Label Encoder or One Hot Encoder(I don't know yet) because I have a Multi-label text classification problem to solve.
I try
df['labels1'] = df['labels1'].str.split(',', expand=True)
but it splits only the first item. Also before try to split the column I try to change the type but I didn't make it.
id
0 Politics, Journals, International
1 Social, Blogs, Celebrities
2 Media, Blogs, Video
3 Food&Drink, Cooking
4 Media, Blogs, Video
5 Culture
6 Social, TV Shows
7 News, Crime, National
8 Social, Blogs, Celebrities
9 Social, Blogs, Celebrities
10 Social, Blogs, Celebrities
11 Family, Blogs
12 Media, Blogs, Video
13 Social, TV Shows
14 Entertainment, TV Shows
15 Social, TV Shows
16 Social, Blogs, Celebrities

It seems like for the right side of the equation of df['labels1'].str.split(',', expand=True) would spit out two items. So perhaps you can do something like:
df['newcolumn1'], df['newcolumn2'] = df['labels1'].str.split(',', expand=True)

You try to set a column of a dataframe with a three-columns-dataframe - which unfortunately silently is done by passing only the first column...
Perhaps you try to concatenate the new three expanded columns to the first dataframe
df = pd.concat([df, df['labels1'].str.split(', ', expand=True)], 1)
or perhaps just go on with this step in a new one
df_exp = df['labels1'].str.split(', ', expand=True)
Edit:
IIUC, your binary table can be created like this (but I don't know if this is the recommended way to do):
col_head = set(df.labels1.str.split(', ', expand=True).values.flatten())
bin_tbl = pd.DataFrame(columns=col_head)
for c in bin_tbl:
bin_tbl[c] = df.labels1.str.split(', ').apply(lambda x: c in x)

Related

Splitting column of a really big dataframe in two (or more) new cols

Problem
Hey there! I'm having some trouble trying to split one column of my dataframe in two (or even more) new columns. I think this depends on the fact that the dataframe I'm working with comes from a really big csv file, almost 10gb worth of space. Once it is loaded into a Pandas dataframe, this is represented by ~60mil of rows and 5 cols.
Example
Initially, the dataframes looks something like this:
In [1]: df
Out[1]:
category other_col
0 animal.cat 5
1 animal.dog 3
2 clothes.shirt.sports 6
3 shoes.laces 1
4 None 0
I want to first remove the rows of the df for which the category is not defined (i.e., the last one), and then split the category column in three new columns based on where the dot appears: one for the main category, one for the first subcategory and another one for the last subcategory (if that actually exists). Finally, I want to merge the whole dataframe back together.
In other words, this is what I want to obtain:
In [2]: df_after
Out[2]:
other_col main_cat sub_category_1 sub_category_2
0 5 animal cat None
1 3 animal dog None
2 6 clothes shirt sports
3 1 shoes laces None
My approach
My approach for this was the following:
df = df[df['category'].notnull()]
df_wt_cat = df.drop(columns=['category'])
df_cat_subcat = df['category'].str.split('.', expand=True).rename(columns={0: 'main_cat', 1: 'sub_category_1', 2: 'sub_category_2', 3: 'sub_category_3'})
df_after = pd.concat([df_wt_cat, df_cat_subcat], axis=1)
which seems to work just fine with small datasets, but it sucks up too much memory when this is applied on a dataframe that big and the Jupyter kernel just dies.
I've tried to read the dataframe in chunks, but I'm not really sure how should I proceed after that; I've obviously tried searching this kind of problem here on stack overflow, but I didn't manage to find anything useful.
Any help is appreciated!
split and join methods do the job:
results = df['category'].str.split(".", expand = True))
df_after = df.join(results)
after doing that you can freely filter resulting dataframe

Pandas - Find String and return adjacent values for the matched data

I'm struggling to write a piece of code to achieve/overcome below problem.
I have two excel spreadsheets. Lets take as an example
DF1 - 1. Master Data
DF2 - 2. consumer details.
I need to iterate description column in Consumer details which contains string or sub string which is in Master data sheet and return a adjacent value. I understand, its pretty straight forward and simple but unable to succeed.
I was using Index Match in Excel -
INDEX('Path\[Master Sheet.xlsx]Master
List'!$B$2:$B$199,MATCH(TRUE,ISNUMBER(SEARCH('path\[Master
Sheet.xlsx]Master List'!$A$2:$A$199,B3)),0))
But need a solution in Python/Pandas -
Eg Df1 - Master SheetMaster Sheet -
Store Category
Nike Shoes
GAP Clothing
Addidas Shoes
Apple Electronics
Abercrombie Clothing
Hollister Clothing
Samsung Electornics
Netflix Movies
etc.....
df2 - Consumer Sheet-
Date Description Amount Category
01/01/20 GAP Stores 1.1
01/01/20 Apple Limited 1000
01/01/20 Aber fajdfal 50
01/01/20 hollister das 20
01/01/20 NETFLIX.COM 10
01/01/20 GAP Kids 5.6
Now, I need to update Category column in consumer sheet based on description(string/substring) column in consumer sheet referring to stores column in master sheet
Any inputs/suggestion, highly appreciated.
One option is to make a custom function that loops through the Df1 values in order to match a store to a string provided as an argument. if a match if found it will return the associated category string and if none is found return None or some other default value. You can use str.lower to increase the chances of a match being found. You then use pandas.Series.apply to apply this function to the column you want to try and find matches in.
import pandas as pd
df1 = pd.DataFrame(dict(
Store = ['Nike','GAP','Addidas','Apple','Abercrombie'],
Category = ['Shoes','Clothing','Shoes','Electronics','Clothing'],
))
df2 = pd.DataFrame(dict(
Date = ['01/01/20','01/01/20','01/01/20'],
Description = ['GAP Stores','Apple Limited','Aber fajdfal'],
Amount = [1.1,1000,50],
))
def get_cat(x):
global df1
for store, cat in df1[['Store','Category']].values:
if store.lower() in x.lower():
return cat
df2['Category'] = df2['Description'].apply(get_cat)
print(df2)
Output:
Date Description Amount Category
0 01/01/20 GAP Stores 1.1 Clothing
1 01/01/20 Apple Limited 1000.0 Electronics
2 01/01/20 Aber fajdfal 50.0 None
Python tutor link to example
I should note that if 'Aber fajdfal' is supposed to match to 'Abercrombie' then this solution is not going to work. You'll need to add more complex logic to the function in order to match partial strings like that.

Removing duplicates with a condition in data frame

I have a data frame with text as one column and its labels as other column.
The texts are duplicates with a single label.
I want to remove these duplicates and keep the records for only the label specified.
Sample dataframe:
text label
0 great view a
1 great view b
2 good balcony a
3 nice service a
4 nice service b
5 nice service c
6 bad rooms f
7 nice restaurant a
8 nice restaurant d
9 nice beach nearby x
10 good casino z
Now if I want to keep the text wherever label a is present and remove only the duplicates.
Sample output:
text label
0 great view a
1 good balcony a
2 nice service a
3 bad rooms f
4 nice restaurant a
5 nice beach nearby x
6 good casino z
Thanks in advance!
You can simple try sort_values before drop_duplicates, since the df will first ordered by the label by the order of alpha beta (a>b yield to True)
df=df.sort_values('label').drop_duplicates('text')
Or
df=df.sort_values('label').groupby('text').head(1)
Update
Valuetokeep='a'
df=df.iloc[(df.label!=Valuetokeep).argsort()].drop_duplicates('text')

Cleaning survey data in python - how to finding and cleaning common rows in two files?

I am working on a survey data analysing project which consist 2 Excel files- in file pre-survey, it contains 800+ response records; while in post-survey file it contains 500ish responses. Both of them have (at least) one common column SID (Student ID). Something Y happened in between, and I am interested analysing the effectiveness of Y, and in what degrade Y impacts on different categories of people.
What adds more complexity is that in each Excel file, it contains multiple tabs. Different interviewers interviewed several interviewees and documented in each tabs for different sections of survey. Columns may or may not be the same for different tabs, so it would be hard to be complied in one file. (Or does it actually make sense to combine them in one with lots of null values?)
I am trying to find the students who did both pre- and post- surveys. How can I do it across sheets and files using python/pandas/other packages?
Bonus if you could also suggest the approach to solve the problem.
If i'm understanding this correctly, your data is currently formatted like this
survey1.xlsx
Sheet 1 (interviewer a)
STU-ID QUESTION 1 RESPONSE 1 QUESTION 2 RESPONSE 2
00001 tutoring? True lunch a? False
survey1.xlsx
Sheet 2 (interviewer b)
STU-ID QUESTION 1 RESPONSE 1 QUESTION 2 RESPONSE 2
00004 tutoring? True lunch a? TRUE
survey2.xlsx
Sheet 1
STU-ID QUESTION 1 RESPONSE 1 Tutorer GPA
00001 improvement? True Jim 3.5
survey2.xlsx
Sheet 2 (interviewer b)
STU-ID QUESTION 1 RESPONSE 1 Tutorer GPA
00004 improvement? yes Sally 2.8
if that's the case, and without knowing the data well, I would combine the tabs so that the pre-survey has the unique student ID (i'm not sure if the same student was interviewed by multiple surveyors) (if they were, you may need to do a group by, but that sounds messy)
Then I would do the same for the post survey response. Then join them into a single dataframe. From the df create a new DF with only the responses you care about (this could get rid of some na answers).
do a df.describe, and a df.dtypes
transform the data so that answers such as "yes/no" become booleans, or atleast so they're all the same format, and the same for numerical responses (int64 or float64)
Finally, I would dropna, so that the df follows your guidelines of containing responses from the first survey, and the second survey.
side note: with only 800 responses, it may be easier to do this just in excel, if you aren't comfortable with python, it would take you several hours to accomplish this, when in excel, it could take you 20 minutes.
If your goal is to learn python, then go for it
Python
import pandas as pd
df_s1s1 = pd.read_excel('survey1.xlsx', na_values="Missing", sheet_names='sheet 1', usecols=cols)
df.head()
df_s1s2 = pd.read_excel('survey1.xlsx', na_values="Missing", sheet_names='sheet 2', usecols=cols)
df_s1s2.head()
and then the same for the second survey file
df_s2s1 = pd.read_excel('survey2.xlsx', na_values="Missing", sheet_names='sheet 1', usecols=cols)
df.head()
df_s2s2 = pd.read_excel('survey2.xlsx', na_values="Missing", sheet_names='sheet 2', usecols=cols)
df_s1s2.head()
to add the different sheets to the same dataframe as rows you would use something like this
df_survey_1 = pd.concat([df_s1s1, df_s1s2])
df_survey_1.head()
then the same for the second survey
df_survey_2 = pd.concat([df_s2s1, df_s2s2])
df_survey_2.head()
and then to create the larger dataframe with all of the columns you would use something like this
master_df = pd.merge(df_survey_1, df_survey2, left_on='STU_ID', right_on='STU_ID')
Drop NA
master_df = master_df.dropna(axis = 0, how ='any')
hope this helps

Count occurrence of elements in column of lists (with a twist)

I've got a column of lists called "author_background" which I would like to analyze. The actual column consists of 8.000 rows. My aim is to get an overview on how many different elements there are in total (in all lists of the column) and count in how many lists each element occurs in.
How my column looks like:
df.author_background
0 [Professor for Business Administration, Harvard Business School]
1 [Professor for Industrial Engineering, University of Oakland]
2 [Harvard Business School]
3 [CEO, SpaceX]
desired output
0 Harvard Business School 2
1 Professor for Business Administration 1
2 Professor for Industrial Engineering 1
3 CEO 1
4 University of Oakland 1
5 SpaceX 1
I would like to know how often "Professor of Business Administration", "Professor for Industrial Engineering", "Harvard Business School", etc. occurs in the column. There are way more titles I don't know about.
Basically, I would like to use pd.value_counts for the column. However, its not possible because its a list.
Is there another way to count the occurrence of each element?
If thats more helpful: I also got a list which contains all elements of the lists (not nested).
Turn it all into a single series by list flattening:
pd.Series([bg for bgs in df.author_background for bg in bgs])
Now you can call value_counts() to get your result.
You can try so:
el = pd.Series([item for sublist in df.author_background for item in sublist])
df = el.groupby(el).size().rename_axis('author_background').reset_index(name='counter')

Categories

Resources