Excel, How to split cells by comma delimiter into new cells - python

So let's say I have data like this with some delimiter like commas that I want to split to new cells either across to columns or down into rows.
The Data
Location
One Museum, Two Museum
City A
3rd Park, 4th Park, 5th Park
City B
How would you do it in either direction? There are lots of methods why is methods provided preferred?
Looking for methods in:
Python
Excel
Power Query
R

The Excel manual method: click on Data>Text to Column. Now just copy and past if you want the data in one column. This is only good when the data set is small and your are doing it once.
The Power Query method: This method you do it once for the data source then click refresh button when the data changes in the future. The data source can be almost anything like csv, website or etc. Steps below:
1 - Pick your data source
2 - When within excel choose From Table/ Range
3 - Now choose the split method, there is delimiter and there is 6 other choices.
4 - For this data I when with custom and use ", "
5 & 6 - To split down you have to select Advanced options. Make the selection.
7 Close & Load
This is a good method because you don't have to code in Power Query unless you want to.

The Python method
Make sure you have pip installed pandas or use conda to install pandas.
The code is like so:
import pandas as pd
df = pd.read_excel('path/to/myexcelfile.xlsx')
df[['key.0','key.1','key.2']] = df['The Data'].str.split(',', expand=True)
df.drop(columns=['The Data'], inplace = True)
# stop here if you want the data to be split into new columns
The data looks like this
Location key.0 key.1 key.2
0 City A One Museum Two Museum None
1 City B 3rd park 4th park 5th park
To get the split into rows proceed with the next code part:
stacked = df.set_index('Location').stack()
# set the name of the new series created
df = stacked.reset_index(name='The Data')
# drop the 'source' level (key.*)
df.drop('level_1', axis=1, inplace=True)
Now this is done and it looks like this
Location The Data
0 City A One Museum
1 City A Two Museum
2 City B 3rd park
3 City B 4th park
4 City B 5th park
The benefit of python is that is faster for larger data sets you can split using regex in probable a 100 ways. The data source can be all types that you would use for power query and more.

R
library(data.table)
dt <- fread("yourfile.csv") # or use readxl package for xls files
dt
# Data Location
# 1: One Museum, Two Museum City A
# 2: 3rd Park, 4th Park, 5th Park City B
dt[, .(Data = unlist(strsplit(Data, ", "))), by = Location]
# Location Data
# 1: City A One Museum
# 2: City A Two Museum
# 3: City B 3rd Park
# 4: City B 4th Park
# 5: City B 5th Park

Related

how to spot delete spaces in pandas column

i have a dataframe with a column location which looks like this:
on the screenshot you see the case with 5 spaces in location column, but there are a lot more cells with 3 and 4 spaces, while the most normal case is just two spaces: between the city and the state, and between the state and the post-code.
i need to perform the str.split() on location column, but due to the different number of spaces it will not work, because if i substitute spaces with empty space or commas, i'll get different number of potential splits.
so i need to find a way to turn spaces that are inside city names into hyphens, so that i am able to do the split later, but at the same time not touch other spaces (between city and state, and between state and post code). any ideas?
I have written those code in terms of easy understanding/readability. One way to solve above query is to split location column first into city & state, perform operation on city & merge back with state.
import pandas as pd
df = pd.DataFrame({'location':['Cape May Court House, NJ 08210','Van Buron Charter Township, MI 48111']})
df[['city','state'] ]= df['location'].str.split(",",expand=True)
df['city'] = df['city'].str.replace(" ",'_')
df['location_new'] = df['city']+','+df['state']
df.head()
final output will look like this with required output in column location_new :

Merging rows in a dataframe by ID

I have a large excel document of people who have had vaccinations.
I am trying to use python and pandas to work with this data to help work out who still needs further vaccinations and who does not.
I have happily imported the document into pandas as a dataframe
Each person has a unique ID in the dataframe
However, each vaccination has a separate row (rather than each person)
i.e. people who have had more than a single dose of vaccine have multiple rows in the document
I want to join all of the vaccinations together so that each person has a single row and all the vaccinations they have had are listed in their row.
ID
NAME
VACCINE
VACCINE DATE
0
JP
AZ
12/01/2021
1
PL
PF
13/01/2021
0
JP
MO
24/01/2021
1
PL
MO
24/01/2021
2
LK
AZ
12/01/2021
3
MN
AZ
12/01/2021
Should become:
ID
NAME
VACCINE
VACCINE DATE
VACCINE2
VACCINE2 DATE
0
JP
AZ
12/01/2021
MO
24/01/2021
1
PL
PF
13/01/2021
MO
24/01/2021
2
LK
AZ
12/01/2021
3
MN
AZ
12/01/2021
So I want to store all vaccine information for each individual in a single entry.
I have tried to use groupby to do this but it seems to entirely delete the ID field??
Am I using completely the wrong tool?
I don't want to resort to using a for loop to just iterate though every entry as this feels like very much the wrong way to accomplish the task.
old_df = pd.read_excel(filename, sheet_name="Report Data")
new_df = old_df.groupby(["PATIENT ID"]).ffill()
I am trying my best to use this as a way of teaching myself to use pandas but struggling to get anywhere so please forgive my novice level.
EDIT:
I have found this code:
s = raw_file.groupby('ID')['Vaccine Date'].apply(list)
new_file = pd.DataFrame(s.values.tolist(), index=s.index).add_prefix('Vaccine Date ').reset_index()
I modified this from what seemed to be a similar problem I found:
Python, Merging rows with same value in one column
Which seems to be doing part of what I want. It creates new columns for each vaccine date with a slightly adjusted column label. However, I cannot see a way to do this for both Vaccine date AND Vaccine brand at the same time and without losing all other data in the table.
I suppose I could just do it twice and then merge the outputs with the original dataframe to make a new complete dataframe but thought there might be a more elegant solution.
This is what I did for the same problem.
Seperated the first vaccination row of every name from the dataframe and saved as a nw dataframe.
Create a third dataframe which is initial dataframe minus the new datafame in (1)
left join both dataframes and drop NA
df = pd.read_csv('<your_dataset_name>.csv')
first_dataframe = df.drop_duplicates(subset = 'ID', keep = 'first')
data = df[~df.isin(first_dataframe )].dropna()
final = first.merge(data,how = 'left',on = ['ID','NAME'])

python pandas , create a new column from an existing column , take the x n umber of characters from a string

i have a dataframe:
Project
The Bike Shop - London
Car Dealer - New York
Airport - Berlin
I want to add 2 new columns to the dataframe : business & location.
i can find where the "-" is in the string by using:
df['separator'] = df['Project'].str.find('-')
whats the best and cleanest way to get 2 new fields into the dataframe?
ie, ProjectType & Location
Project ProjectType Location
The Bike Shop - London the Bike Shop London
Car Dealer - New York Car Dealer New York
Airport - Berlin Airport Berlin
thanks in advance :)
If I'm understanding correctly, your current dataframe looks something like this:
and you want it to look like this:
If that's what you're looking for, you can use a list comprehension:
df['ProjectType'] = [project.split(' - ')[0] for project in df['Project']]
df['Location'] = [project.split(' - ')[1] for project in df['Project']]
del df['Project'] # If you want to remove the original column
if your data is separated by '-', you can split it into several columns at once
new_df = df['new_values'].str.split('\n',expand=True)
here it is well described how to divide the column into others
http://datalytics.ru/all/kak-v-pandas-razbit-kolonku-na-neskolko-kolonok/

Pandas - Find String and return adjacent values for the matched data

I'm struggling to write a piece of code to achieve/overcome below problem.
I have two excel spreadsheets. Lets take as an example
DF1 - 1. Master Data
DF2 - 2. consumer details.
I need to iterate description column in Consumer details which contains string or sub string which is in Master data sheet and return a adjacent value. I understand, its pretty straight forward and simple but unable to succeed.
I was using Index Match in Excel -
INDEX('Path\[Master Sheet.xlsx]Master
List'!$B$2:$B$199,MATCH(TRUE,ISNUMBER(SEARCH('path\[Master
Sheet.xlsx]Master List'!$A$2:$A$199,B3)),0))
But need a solution in Python/Pandas -
Eg Df1 - Master SheetMaster Sheet -
Store Category
Nike Shoes
GAP Clothing
Addidas Shoes
Apple Electronics
Abercrombie Clothing
Hollister Clothing
Samsung Electornics
Netflix Movies
etc.....
df2 - Consumer Sheet-
Date Description Amount Category
01/01/20 GAP Stores 1.1
01/01/20 Apple Limited 1000
01/01/20 Aber fajdfal 50
01/01/20 hollister das 20
01/01/20 NETFLIX.COM 10
01/01/20 GAP Kids 5.6
Now, I need to update Category column in consumer sheet based on description(string/substring) column in consumer sheet referring to stores column in master sheet
Any inputs/suggestion, highly appreciated.
One option is to make a custom function that loops through the Df1 values in order to match a store to a string provided as an argument. if a match if found it will return the associated category string and if none is found return None or some other default value. You can use str.lower to increase the chances of a match being found. You then use pandas.Series.apply to apply this function to the column you want to try and find matches in.
import pandas as pd
df1 = pd.DataFrame(dict(
Store = ['Nike','GAP','Addidas','Apple','Abercrombie'],
Category = ['Shoes','Clothing','Shoes','Electronics','Clothing'],
))
df2 = pd.DataFrame(dict(
Date = ['01/01/20','01/01/20','01/01/20'],
Description = ['GAP Stores','Apple Limited','Aber fajdfal'],
Amount = [1.1,1000,50],
))
def get_cat(x):
global df1
for store, cat in df1[['Store','Category']].values:
if store.lower() in x.lower():
return cat
df2['Category'] = df2['Description'].apply(get_cat)
print(df2)
Output:
Date Description Amount Category
0 01/01/20 GAP Stores 1.1 Clothing
1 01/01/20 Apple Limited 1000.0 Electronics
2 01/01/20 Aber fajdfal 50.0 None
Python tutor link to example
I should note that if 'Aber fajdfal' is supposed to match to 'Abercrombie' then this solution is not going to work. You'll need to add more complex logic to the function in order to match partial strings like that.

Extract part from an address in pandas dataframe column

I work through a pandas tutorial that deals with analyzing sales data (https://www.youtube.com/watch?v=eMOA1pPVUc4&list=PLFCB5Dp81iNVmuoGIqcT5oF4K-7kTI5vp&index=6). The data is already in a dataframe format, within the dataframe is one column called "Purchase Address" that contains street, city and state/zip code. The format looks like this:
Purchase Address
917 1st St, Dallas, TX 75001
682 Chestnut St, Boston, MA 02215
...
My idea was to convert the data to a string and to then drop the irrelevant list values. I used the command:
all_data['Splitted Address'] = all_data['Purchase Address'].str.split(',')
That worked for converting the data to a comma separated list of the form
[917 1st St, Dallas, TX 75001]
Now, the whole column 'Splitted Address' looks like this and I am stuck at this point. I simply wanted to drop the list indices 0 and 2 and to keep 1, i.e. the city in another column.
In the tutorial the solution was layed out using the .apply()-method:
all_data['Column'] = all_data['Purchase Address'].apply(lambda x: x.split(',')[1])
This solutions definitely looks more elegant than mine so far, but I wondered whether I can reach a solution with my approach with a comparable amount of effort.
Thanks in advance.
Use Series.str.split with selecting by indexing:
all_data['Column'] = all_data['Purchase Address'].str.split(',').str[1]

Categories

Resources