Merging rows in a dataframe by ID - python

I have a large excel document of people who have had vaccinations.
I am trying to use python and pandas to work with this data to help work out who still needs further vaccinations and who does not.
I have happily imported the document into pandas as a dataframe
Each person has a unique ID in the dataframe
However, each vaccination has a separate row (rather than each person)
i.e. people who have had more than a single dose of vaccine have multiple rows in the document
I want to join all of the vaccinations together so that each person has a single row and all the vaccinations they have had are listed in their row.
ID
NAME
VACCINE
VACCINE DATE
0
JP
AZ
12/01/2021
1
PL
PF
13/01/2021
0
JP
MO
24/01/2021
1
PL
MO
24/01/2021
2
LK
AZ
12/01/2021
3
MN
AZ
12/01/2021
Should become:
ID
NAME
VACCINE
VACCINE DATE
VACCINE2
VACCINE2 DATE
0
JP
AZ
12/01/2021
MO
24/01/2021
1
PL
PF
13/01/2021
MO
24/01/2021
2
LK
AZ
12/01/2021
3
MN
AZ
12/01/2021
So I want to store all vaccine information for each individual in a single entry.
I have tried to use groupby to do this but it seems to entirely delete the ID field??
Am I using completely the wrong tool?
I don't want to resort to using a for loop to just iterate though every entry as this feels like very much the wrong way to accomplish the task.
old_df = pd.read_excel(filename, sheet_name="Report Data")
new_df = old_df.groupby(["PATIENT ID"]).ffill()
I am trying my best to use this as a way of teaching myself to use pandas but struggling to get anywhere so please forgive my novice level.
EDIT:
I have found this code:
s = raw_file.groupby('ID')['Vaccine Date'].apply(list)
new_file = pd.DataFrame(s.values.tolist(), index=s.index).add_prefix('Vaccine Date ').reset_index()
I modified this from what seemed to be a similar problem I found:
Python, Merging rows with same value in one column
Which seems to be doing part of what I want. It creates new columns for each vaccine date with a slightly adjusted column label. However, I cannot see a way to do this for both Vaccine date AND Vaccine brand at the same time and without losing all other data in the table.
I suppose I could just do it twice and then merge the outputs with the original dataframe to make a new complete dataframe but thought there might be a more elegant solution.

This is what I did for the same problem.
Seperated the first vaccination row of every name from the dataframe and saved as a nw dataframe.
Create a third dataframe which is initial dataframe minus the new datafame in (1)
left join both dataframes and drop NA
df = pd.read_csv('<your_dataset_name>.csv')
first_dataframe = df.drop_duplicates(subset = 'ID', keep = 'first')
data = df[~df.isin(first_dataframe )].dropna()
final = first.merge(data,how = 'left',on = ['ID','NAME'])

Related

Excel, How to split cells by comma delimiter into new cells

So let's say I have data like this with some delimiter like commas that I want to split to new cells either across to columns or down into rows.
The Data
Location
One Museum, Two Museum
City A
3rd Park, 4th Park, 5th Park
City B
How would you do it in either direction? There are lots of methods why is methods provided preferred?
Looking for methods in:
Python
Excel
Power Query
R
The Excel manual method: click on Data>Text to Column. Now just copy and past if you want the data in one column. This is only good when the data set is small and your are doing it once.
The Power Query method: This method you do it once for the data source then click refresh button when the data changes in the future. The data source can be almost anything like csv, website or etc. Steps below:
1 - Pick your data source
2 - When within excel choose From Table/ Range
3 - Now choose the split method, there is delimiter and there is 6 other choices.
4 - For this data I when with custom and use ", "
5 & 6 - To split down you have to select Advanced options. Make the selection.
7 Close & Load
This is a good method because you don't have to code in Power Query unless you want to.
The Python method
Make sure you have pip installed pandas or use conda to install pandas.
The code is like so:
import pandas as pd
df = pd.read_excel('path/to/myexcelfile.xlsx')
df[['key.0','key.1','key.2']] = df['The Data'].str.split(',', expand=True)
df.drop(columns=['The Data'], inplace = True)
# stop here if you want the data to be split into new columns
The data looks like this
Location key.0 key.1 key.2
0 City A One Museum Two Museum None
1 City B 3rd park 4th park 5th park
To get the split into rows proceed with the next code part:
stacked = df.set_index('Location').stack()
# set the name of the new series created
df = stacked.reset_index(name='The Data')
# drop the 'source' level (key.*)
df.drop('level_1', axis=1, inplace=True)
Now this is done and it looks like this
Location The Data
0 City A One Museum
1 City A Two Museum
2 City B 3rd park
3 City B 4th park
4 City B 5th park
The benefit of python is that is faster for larger data sets you can split using regex in probable a 100 ways. The data source can be all types that you would use for power query and more.
R
library(data.table)
dt <- fread("yourfile.csv") # or use readxl package for xls files
dt
# Data Location
# 1: One Museum, Two Museum City A
# 2: 3rd Park, 4th Park, 5th Park City B
dt[, .(Data = unlist(strsplit(Data, ", "))), by = Location]
# Location Data
# 1: City A One Museum
# 2: City A Two Museum
# 3: City B 3rd Park
# 4: City B 4th Park
# 5: City B 5th Park

Pandas - Find String and return adjacent values for the matched data

I'm struggling to write a piece of code to achieve/overcome below problem.
I have two excel spreadsheets. Lets take as an example
DF1 - 1. Master Data
DF2 - 2. consumer details.
I need to iterate description column in Consumer details which contains string or sub string which is in Master data sheet and return a adjacent value. I understand, its pretty straight forward and simple but unable to succeed.
I was using Index Match in Excel -
INDEX('Path\[Master Sheet.xlsx]Master
List'!$B$2:$B$199,MATCH(TRUE,ISNUMBER(SEARCH('path\[Master
Sheet.xlsx]Master List'!$A$2:$A$199,B3)),0))
But need a solution in Python/Pandas -
Eg Df1 - Master SheetMaster Sheet -
Store Category
Nike Shoes
GAP Clothing
Addidas Shoes
Apple Electronics
Abercrombie Clothing
Hollister Clothing
Samsung Electornics
Netflix Movies
etc.....
df2 - Consumer Sheet-
Date Description Amount Category
01/01/20 GAP Stores 1.1
01/01/20 Apple Limited 1000
01/01/20 Aber fajdfal 50
01/01/20 hollister das 20
01/01/20 NETFLIX.COM 10
01/01/20 GAP Kids 5.6
Now, I need to update Category column in consumer sheet based on description(string/substring) column in consumer sheet referring to stores column in master sheet
Any inputs/suggestion, highly appreciated.
One option is to make a custom function that loops through the Df1 values in order to match a store to a string provided as an argument. if a match if found it will return the associated category string and if none is found return None or some other default value. You can use str.lower to increase the chances of a match being found. You then use pandas.Series.apply to apply this function to the column you want to try and find matches in.
import pandas as pd
df1 = pd.DataFrame(dict(
Store = ['Nike','GAP','Addidas','Apple','Abercrombie'],
Category = ['Shoes','Clothing','Shoes','Electronics','Clothing'],
))
df2 = pd.DataFrame(dict(
Date = ['01/01/20','01/01/20','01/01/20'],
Description = ['GAP Stores','Apple Limited','Aber fajdfal'],
Amount = [1.1,1000,50],
))
def get_cat(x):
global df1
for store, cat in df1[['Store','Category']].values:
if store.lower() in x.lower():
return cat
df2['Category'] = df2['Description'].apply(get_cat)
print(df2)
Output:
Date Description Amount Category
0 01/01/20 GAP Stores 1.1 Clothing
1 01/01/20 Apple Limited 1000.0 Electronics
2 01/01/20 Aber fajdfal 50.0 None
Python tutor link to example
I should note that if 'Aber fajdfal' is supposed to match to 'Abercrombie' then this solution is not going to work. You'll need to add more complex logic to the function in order to match partial strings like that.

Pandas - Returning coordinates and value when conditional values are met

I am using pandas and python 2.7.13 and I have been trying to import an excel file through pandas and compare data from two separate data frames using specified conditions to find when values from DF2 fall between two time values in DF1, and if the condition is met return a value from DF1 back onto DF2.
The data sets consist of DF2 a large database of records with a DateX (MM/DD/YYYY HH:MM while DF1 is an export of staffing hours with the format of Start Time and End Time both formatted the same way then staff name. We use a 3rd party system for our staffing and it isn't connected to our database and a report I am producing we need to see how specific employees are effecting performance.
Example data:
DF1
Employee: Start Time: End Time:
John Smith 1/1/2017 06:30 1/1/2017 18:30
Jane Smith 1/1/2017 06:30 1/1/2017 18:30
Tommy Boy 1/2/2017 06:30 1/2/2017 15:00
DF2
DateX:
1/1/2017 12:16
1/1/2017 06:43
1/2/2017 19:32
I have some experience with python but this is my first time using Pandas and numpy, my experience is purely project based on items I have attempted. My current code reads as:
import pandas as pd
file = 'sample set.xlsx'
xl = pd.ExcelFile(file)
df1 = xl.parse('Sheet1')
df2 = xl.parse('Sheet2')
for i in df2['DateX']:
if any(i >= df1['Start Time.1']) and any(i <= df1['End Time.1']):
print i
I am only trying to print i currently to insure I am pulling the right number as I am using a limited data set as a test ground. I run into two problems. There can be multiple staff members from DF1 that worked the DateX from DF2, but this stops if there is even 1 match.
The other item is I accepted this and tried working out how could I get it to print out the match from df1['Employee'], but my efforts only produce the entire employee column. This is a step of me learning in trying to then have it add the names that match next to the datex on DF2.
I am still continuing to attempt and read documentation and will update/close if I solve the problem on my own. Thank you.
My answer is similar to #Jay's but will return a list of employees for each time. Pandas unfortunately has no support for conditional joins like SQL. There is a new function merge_asof but it only returns a single value for each row which would not work for you.
The following will work but be very slow.
dfs = []
for i, row in df1.iterrows():
criteria = (row['Start Time'] <= df2['DateX']) & (df2['DateX'] <= row['End Time'])
if not criteria.all():
dfs.append(df2[criteria].assign(Employee=row['Employee']))
df2_all = pd.concat(dfs)
df2_agg = df2_all.groupby('DateX').agg(lambda x: ' | '.join(x.tolist()))
df2_final = df2_agg.reindex(df2.DateX)
Employee
DateX
2017-01-01 06:43:00 [John Smith, Jane Smith]
2017-01-01 12:16:00 [John Smith, Jane Smith]
2017-01-02 19:32:00 NaN

Create new columns for a dataframe by parsing column values and populate new columns with values from another column python

I need to add new columns to a dataframe based on lists within a certain column. The new columns need to be a set derived from all the lists in the column.
I then have another column with lists corresponding to the first but the data is slightly different. I need these values to populate the new columns if the values are not in a "do not include" list
Here is an example:
Disease Status
0 Asthma|ARD Ph II|Ph I
1 Arthritis|Inflammation|Asthma Ph III|Approved|No development reported
This should become:
Disease Status Asthma ARD Arthritis Inflammation
0 Asthma|ARD Ph II|Ph I Ph II Ph I
1 Arthritis|Inflammation|Asthma Ph III|Approved|No development Ph III Approved
Where here the list of "do not include" would just be ['No development'] however there are more terms I would like to include here.
The dataframe I am working with has many columns, I am interested in developing a function in which I can simply pass the df, column names, and a "do not inlcude" list that will perform this task in an efficient way (ideally without any or very few loops).
My current approach has been to create a set from the Disease columns, add it to the dataframe through pd.concat, and then loop through each row, split values in the two columns and then loop through the "Disease" list to put the correct status in the disease column.
The problem with this is that my data frame is ~12k rows, and this becomes exceptionally time intensive.
It seems that you have multiple values in each individual cell (from your previous and current questions). It would be far far easier to tidy up your data first and then continue with your analysis. Try to put each value in each column in its own cell.
df1 = pd.concat([df[col].str.split('|', expand=True).stack().reset_index(1, drop=True) for col in df.columns], axis=1)
Output of df1
0 1
0 Asthma Ph II
0 ARD Ph I
1 Arthritis Ph III
1 Inflammation Approved
1 Asthma No development reported
And then you can pivot this from here and select only the columns you care about
cols = ['Asthma', 'ARD']
df2 = df1.reset_index().pivot(index='index',columns=0, values=1)[cols]
Output of df2
0 Asthma ARD
index
0 Ph II Ph I
1 No development reported None
Then just concatenate this DataFrame to your original
pd.concat((df, df2),axis=1)
Disease Status \
index
0 Asthma|ARD Ph II|Ph I
1 Arthritis|Inflammation|Asthma Ph III|Approved|No development reported
Asthma ARD
index
0 Ph II Ph I
1 No development reported None
make exclusion list a set
str.extractall was a style choice. str.split will be faster
query to get rid of things not to include
join
dont_include = set(['No development'])
d1 = df.stack().str.extractall('([^|]+)')[0].unstack(1) \
.reset_index(1, drop=True).query('Status not in #dont_include') \
.set_index('Disease', append=1).Status.unstack().fillna('')
df.join(d1)

Basic Pandas data analysis: connecting data types

I loaded in a dataframe where there is a variable called natvty which is a frequency of numbers from 50 - 600. Each number represents a country and each country appears more than once. I did a count of the number of times each country appears in the list. Now I would like to replace the number of the country with the name of the country, for example (57 = United States). I tried all kinds of for loops to no avail. Here's my code thus far. In the value counts table, the country number is on the left and the number of times it appears in the data is on the right. I need to replace the number on the left with the country name. The numbers which correspond to country names are in an external excel sheet in two columns. Thanks.
I think there may be no need to REPLACE the country numbers with country names at first. Since you have now two tables, one is with columns ["country_number", "natvty"] and the other (your excel table, can be exported as .csv file and read by pandas) is with columns ["country_number", "country_name"], so you can simply join them both and keep them all. The resulted table would have 3 columns: ["country_number", "natvty", "country_name"], respectively.
import pandas as pd
df_nav = pd.read_csv("my_natvty.csv")
df_cnames = pd.read_csv("excel_country_names.csv") # or use pd.read_excel("country_names.xlsx") directly on excel files
df_nav_with_cnames = df_nav.join(df_cnames, on='country_number')
Make sure they both have a column "country_number". You can modify the table head in the data source files manually, or treat them as index columns to apply join similarly. The concept is a little bit like SQL operations in relational databases.
Documentation: http://pandas.pydata.org/pandas-docs/stable/merging.html
For this sort of thing, I always prefer the map function, which eats a dictionary, or a function for that matter.
import pandas as pd
import numpy.random as np
In [12]:
print
# generate data
df = pd.DataFrame(data={'natvty':np.randint(low=20,high=500,size=10),
'country':pd.Series([1,2,3,3,3,2,1,1,2,3])})
df
country natvty
0 1 24
1 2 310
2 3 88
3 3 459
4 3 38
5 2 63
6 1 194
7 1 384
8 2 281
9 3 360
Then, the dict. Here I just type it, but you could load it from a csv or excel file. Then you'd want to set the key as the index and turn the resulting series into a dict (to_dict()).
countrymap = {1:'US',2:'Canada',3:'Mexico'}
Then you can simply map the value labels.
df.country.map(countrymap)
Out[10]:
0 US
1 Canada
2 Mexico
3 Mexico
4 Mexico
5 Canada
6 US
7 US
8 Canada
9 Mexico
Name: country, dtype: objec
Note: The basic idea here is the same as Shellay's answer. I just wanted to demonstrate how to handle different column names in the two data frames, and how to retrieve the per-country frequencies you wanted.
You have one data frame containing country codes, and another data frame which maps country codes to country names. You simply need to join them on the country code columns. You can read more about merging in Pandas and SQL joins.
import pandas as pd
# this is your nativity frame
nt = pd.DataFrame([
[123],
[123],
[456],
[789],
[456],
[456]
], columns=('natvty',))
# this is your country code map
# in reality, use pd.read_excel
cc = pd.DataFrame([
[123, 'USA'],
[456, 'Mexico'],
[789, 'Canada']
], columns=('country_code', 'country_name'))
# perform a join
# now each row has an associated country_name
df = nt.merge(cc, left_on='natvty', right_on='country_code')
# now you can get frequencies on country names instead of country codes
print df.country_name.value_counts(sort=False)
The output from the above is
Canada 1
USA 2
Mexico 3
Name: country_name, dtype: int64
I think a dictionary would be your best bet. If you had a dict of the countries and their codes e.g.
country_dict = {333: 'United States', 123: 'Canada', 456: 'Cuba', ...}
You presumably have a key of the countries and their codes, so you could make the dict really easily with a loop:
country_dict = {}
for i in country_list:
country = i[0] # If you had list of countries and their numbers
number = i[1]
country_dict[number] = country
Adding a column to your DataFrame once you have this should be straightforward:
import pandas as pd
df = pd.read_csv('my_data.csv', header=None)
df['country'] = [country_dict[x[0][i]] for i in list(df.index)]
This should work if the country codes column has index 0

Categories

Resources