How to pivot a table based on the values of one column - python

let's say I have the below dataframe:
dataframe = pd.DataFrame({'col1': ['Name', 'Location', 'Phone','Name', 'Location'],
'Values': ['Mark', 'New York', '656','John', 'Boston']})
which looks like this:
col1 Values
Name Mark
Location New York
Phone 656
Name John
Location Boston
As you can see I have my wanted columns as rows in col1 and not all values have a Phone number, is there a way for me to transform this dataframe to look like this:
Name Location Phone
Mark New York 656
John Boston NaN
I have tried to transpose in Excel, do a Pivot and a Pivot_Table:
pivoted = pd.pivot_table(data = dataframe, values='Values', columns='col1')
But this comes out incorrectly. any help would be appreciated on this.
NOTES: All new section start with the Name value and end before the Name value of the next person.

Create a new index using cumsum to identify unique sections then do pivot as usual...
df['index'] = df['col1'].eq('Name').cumsum()
df.pivot('index', 'col1', 'Values')
col1 Location Name Phone
index
1 New York Mark 656
2 Boston John NaN

Related

Split a column in Python pandas

I'm sorry if I can't explain properly the issue I'm facing since I don't really understand it that much. I'm starting to learn Python and to practice I try to do projects that I face in my day to day job, but using Python. Right now I'm stuck with a project and would like some help or guidance, I have a dataframe that looks like this
Index Country Name IDs
0 USA John PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39
--------------------------------------------
1 UK Jane PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40
(I apologize since I can't create a table on this post since the separator of the ids is a | ) but you get the idea, every person has 4 IDs and they are all on the same "cell" of the dataframe, each ID separated from its value by pipes, I need to split those ID's from their values, and put them on separate columns so I get something like this
index
Country
Name
PERSID
SSO
STARTDATE
WAVE
0
USA
John
12345
John123
20210101
WAVE39
1
UK
Jane
25478
Jane123
20210101
WAVE40
Now, adding to the complexity of the table itself, I have another issues, for example, the order of the ID's won't be the same for everyone and some of them will be missing some of the ID's.
I honestly have no idea where to begin, the first thing I thought about trying was to split the IDs column by spaces and then split the result of that by pipes, to create a dictionary, convert it to a dataframe and then join it to my original dataframe using the index.
But as I said, my knowledge in python is quite pathetic, so that failed catastrophically, I only got to the first step of that plan with a Client_ids = df.IDs.str.split(), that returns a series with the IDs separated one from each other like ['PERSID|12345', 'SSO|John123', 'STARTDATE|20210101', 'WAVE|Wave39'] but I can't find a way to split it again because I keep getting an error saying the the list object doesn't have attribute 'split'
How should I approach this? what alternatives do I have to do it?
Thank you in advance for any help or recommendation
You have a few options to consider to do this. Here's how I would do it.
I will split the values in IDs by \n and |. Then create a dictionary with key:value for each split of values of |. Then join it back to the dataframe and drop the IDs and temp columns.
import pandas as pd
df = pd.DataFrame([
["USA", "John","""PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39"""],
["UK", "Jane", """PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40"""],
["CA", "Jill", """PERSID|12345
STARTDATE|20210201
WAVE|WAVE41"""]], columns=['Country', 'Name', 'IDs'])
df['temp'] = df['IDs'].str.split('\n|\|').apply(lambda x: {k:v for k,v in zip(x[::2],x[1::2])})
df = df.join(pd.DataFrame(df['temp'].values.tolist(), df.index))
df = df.drop(columns=['IDs','temp'],axis=1)
print (df)
With this approach, it does not matter if a row of data is missing. It will sort itself out.
The output of this will be:
Original DataFrame:
Country Name IDs
0 USA John PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39
1 UK Jane PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40
2 CA Jill PERSID|12345
STARTDATE|20210201
WAVE|WAVE41
Updated DataFrame:
Country Name PERSID SSO STARTDATE WAVE
0 USA John 12345 John123 20210101 WAVE39
1 UK Jane 25478 Jane123 20210101 WAVE40
2 CA Jill 12345 NaN 20210201 WAVE41
Note that Jill did not have a SSO value. It set the value to NaN by default.
First generate your dataframe
df1 = pd.DataFrame([["USA", "John","""PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39"""],
["UK", "Jane", """
PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40"""]], columns=['Country', 'Name', 'IDs'])
Then split the last cell using lambda
df2 = pd.DataFrame(list(df.apply(lambda r: {p:q for p,q in [x.split("|") for x in r.IDs.split()]}, axis=1).values))
Lastly concat the dataframes together.
df = pd.concat([df1, df2], axis=1)
Quick solution
remove_word = ["PERSID", "SSO" ,"STARTDATE" ,"WAVE"]
for i ,col in enumerate(remove_word):
df[col] = df.IDs.str.replace('|'.join(remove_word), '', regex=True).str.split("|").str[i+1]
Use regex named capture groups with pd.String.str.extract
def ng(x):
return f'(?:{x}\|(?P<{x}>[^\n]+))?\n?'
fields = ['PERSID', 'SSO', 'STARTDATE', 'WAVE']
pat = ''.join(map(ng, fields))
df.drop('IDs', axis=1).join(df['IDs'].str.extract(pat))
Country Name PERSID SSO STARTDATE WAVE
0 USA John 12345 John123 20210101 WAVE39
1 UK Jane 25478 Jane123 20210101 WAVE40
2 CA Jill 12345 NaN 20210201 WAVE41
Setup
Credit to #JoeFerndz for sample df.
NOTE: this sample has missing values in some 'IDs'.
df = pd.DataFrame([
["USA", "John","""PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39"""],
["UK", "Jane", """PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40"""],
["CA", "Jill", """PERSID|12345
STARTDATE|20210201
WAVE|WAVE41"""]], columns=['Country', 'Name', 'IDs'])

Removing the rows from dataframe till the actual column names are found

I am reading tabular data from the email in the pandas dataframe.
There is no guarantee that column names will contain in the first row.Sometimes data is in the following format.The actual column names are [ID,Name and Year]
dummy1 dummy2 dummy3
test_column1 test_column2 test_column3
ID Name Year
1 John Sophomore
2 Lisa Junior
3 Ed Senior
Sometimes the column names come in the first row as expected.
ID Name Year
1 John Sophomore
2 Lisa Junior
3 Ed Senior
Once I read the HTML table from the email,how I remove the initial rows that don't contain the column names?So in the first case I would need to remove first 2 rows in the dataframe(including column row) and in the second case,i wouldn't have to remove anything.
Also,the column names can be in any sequence.
basically,I want to do in following
1.check whether once of the column names contains in one of the rows in dataframe
2.Remove the rows above
if "ID" in row:
remove the above rows
How can I achieve this?
You can first get index of valid columns and then filter and set accordingly.
df = pd.read_csv("d.csv",sep='\s+', header=None)
col_index = df.index[(df == ["ID","Name","Year"]).all(1)].item() # get columns index
df.columns = df.iloc[col_index].to_numpy() # set valid columns
df = df.iloc[col_index + 1 :] # filter data
df
ID Name Year
3 1 John Sophomore
4 2 Lisa Junior
5 3 Ed Senior
or
If you want to se ID as index
df = df.iloc[col_index + 1 :].set_index('ID')
df
Name Year
ID
1 John Sophomore
2 Lisa Junior
3 Ed Senior
Ugly but effective quick try:
id_name = df.columns[0]
df_clean = df[(df[id_name] == 'ID') | (df[id_name].dtype == 'int64')]

Pandas duplicated rows with missing values

Hello I have a dataframe that contains duplicates.
df = pd.DataFrame({'id':[1,1,1],
'name':['Hamburg','Hamburg','Hamburg'],
'country':['Germany','Germany',None],
'state':[None,None,'Hamburg']})
removing the duplicates with df.drop_duplicates() returns:
How can I configure drop_duplicates such that only one row is left, that contains all the information?
In the case you have no row with all the information at once, you can use groupby and first but first fillna None with np.nan to work with missing values:
print (df.fillna(value=np.nan).groupby('id').first())
name country state
id
1 Hamburg Germany Hamburg
In your very special case, here's my proposal :
import pandas
df = pandas.DataFrame({'id':[1,1,1,2,2],
'name':['Hamburg','Hamburg','Hamburg','Paris','Paris'],
'country':['Germany','Germany',None, None, 'France'],
'state':[None,None,'Hamburg', 'Paris', None]})
df_result=pandas.DataFrame()
for id in df['id'].unique().tolist() :
df_subset=df[df['id']==id].copy(deep=True)
df_subset.sort_values(by=['id','name','country','state'],inplace=True)
df_subset.bfill(inplace=True)
df_subset.ffill(inplace=True)
df_subset.drop_duplicates(inplace=True)
df_result=df_result.append(df_subset)
df=df_result
Out[18]:
id name country state
0 1 Hamburg Germany Hamburg
4 2 Paris France Paris
Subsetting the records will avoid ffill or bfill to fill adjacent but different id records.
Regards

Update missing values in a column using pandas

I have a dataframe df with two of the columns being 'city' and 'zip_code':
df = pd.DataFrame({'city': ['Cambridge','Washington','Miami','Cambridge','Miami',
'Washington'], 'zip_code': ['12345','67891','23457','','','']})
As shown above, a particular city contains zip code in one of the rows, but the zip_code is missing for the same city in some other row. I want to fill those missing values based on the zip_code values of that city in some other row. Basically, wherever there is a missing zip_code, it checks zip_code for that city in other rows, and if found, fills the value for zip_code.If not found, fills 'NA'.
How do I accomplish this task using pandas?
You can go for:
import numpy as np
df['zip_code'] = df.replace(r'', np.nan).groupby('city')['zip_code'].fillna(method='ffill').fillna(method='bfill')
>>> df
city zip_code
0 Cambridge 12345
1 Washington 67891
2 Miami 23457
3 Cambridge 12345
4 Miami 23457
5 Washington 67891
You can check the string length using str.len and for those rows, filter the main df to those with valid zip_codes, set the index to those and call map on the 'city' column which will perform the lookup and fill those values:
In [255]:
df.loc[df['zip_code'].str.len() == 0, 'zip_code'] = df['city'].map(df[df['zip_code'].str.len() == 5].set_index('city')['zip_code'])
df
Out[255]:
city zip_code
0 Cambridge 12345
1 Washington 67891
2 Miami 23457
3 Cambridge 12345
4 Miami 23457
5 Washington 67891
If your real data has lots of repeating values then you'll need to additionally call drop_duplicates first:
df.loc[df['zip_code'].str.len() == 0, 'zip_code'] = df['city'].map(df[df['zip_code'].str.len() == 5].drop_duplicates(subset='city').set_index('city')['zip_code'])
The reason you need to do this is because it'll raise an error if there are duplicate index entries
My suggestion would be to first create a dictonary that maps from the city to the zip code. You can create this dictionary from the one DataFrame.
And then you use that dictionary to fill in all missing zip code values.

How do I use a mapping variable to re-index a dataframe?

I have the following data frame:
population GDP
country
United Kingdom 4.5m 10m
Spain 3m 8m
France 2m 6m
I also have the following information in a 2 column dataframe(happy for this to be made into another datastruct if that will be more beneficial as the plan is that it will be sorted in a VARS file.
county code
Spain es
France fr
United Kingdom uk
The 'mapping' datastruct will be sorted in a random order as countries will be added/removed at random times.
What is the best way to re-index the data frame to its country code from its country name?
Is there a smart solution that would also work on other columns so for example if a data frame was indexed on date but one column was df['county'] then you could change df['country'] to its country code? Finally is there a third option that would add an additional column that was either country/code which selected the right code based on a country name in another column?
I think you can use Series.map, but it works only with Series, so need Index.to_series. Last rename_axis (new in pandas 0.18.0):
df1.index = df1.index.to_series().map(df2.set_index('county').code)
df1 = df1.rename_axis('county')
#pandas bellow 0.18.0
#df1.index.name = 'county'
print (df1)
population GDP
county
uk 4.5m 10m
es 3m 8m
fr 2m 6m
It is same as mapping by dict:
d = df2.set_index('county').code.to_dict()
print (d)
{'France': 'fr', 'Spain': 'es', 'United Kingdom': 'uk'}
df1.index = df1.index.to_series().map(d)
df1 = df1.rename_axis('county')
#pandas bellow 0.18.0
#df1.index.name = 'county'
print (df1)
population GDP
county
uk 4.5m 10m
es 3m 8m
fr 2m 6m
EDIT:
Another solution with Index.map, so to_series is omitted:
d = df2.set_index('county').code.to_dict()
print (d)
{'France': 'fr', 'Spain': 'es', 'United Kingdom': 'uk'}
df1.index = df1.index.map(d.get)
df1 = df1.rename_axis('county')
#pandas bellow 0.18.0
#df1.index.name = 'county'
print (df1)
population GDP
county
uk 4.5m 10m
es 3m 8m
fr 2m 6m
Here are some brief ways to approach your 3 questions. More details below:
1) How to change index based on mapping in separate df
Use df_with_mapping.todict("split") to create a dictionary, then use a list comprehension to change it into {"old1":"new1",...,"oldn":"newn"} form then use df.index = df.base_column.map(dictionary) to get the changed index.
2) How to change index if the new column is in the same df:
df.index = df["column_you_want"]
3) Creating a new column by mapping on a old column:
df["new_column"] = df["old_column"].map({"old1":"new1",...,"oldn":"newn"})
1) Mapping for the current index exists in separate dataframe but you don't have the mapped column in the dataframe yet
This is essentially the same as question 2 with the additional step of creating a dictionary for the mapping you want.
#creating the mapping dictionary in the form of current index : future index
df2 = pd.DataFrame([["es"],["fr"]],index = ["spain","france"])
interm_dict = df2.to_dict("split") #Creates a dictionary split into column labels, data labels and data
mapping_dict = {country:data[0] for country,data in zip(interm_dict["index"],interm_dict['data'])}
#We only want the first column of the data and the index so we need to make a new dict with a list comprehension and zip
df["country"] = df.index #Create a new column if u want to save the index
df.index = pd.Series(df.index).map(mapping_dict) #change the index
df.index.name = "" #Blanks out index name
df = df.drop("county code",1) #Drops the county code column to avoid duplicate columns
Before:
county code language
spain es spanish
france fr french
After:
language country
es spanish spain
fr french france
2) Changing the current index to one of the columns already in the dataframe
df = pd.DataFrame([["es","spanish"],["fr","french"]], columns = ["county code","language"], index = ["spain", "french"])
df["country"] = df.index #if you want to save the original index
df.index = df["county code"] #The only step you actually need
df.index.name = "" #if you want a blank index name
df = df.drop("county code",1) #if you dont want the duplicate column
Before:
county code language
spain es spanish
french fr french
After:
language country
es spanish spain
fr french french
3) Creating an additional column based on another column
This is again essentially the same as step 2 except we create an additional column instead of assigning .index to the created series.
df = pd.DataFrame([["es","spanish"],["fr","french"]], columns = ["county code","language"], index = ["spain", "france"])
df["city"] = df["county code"].map({"es":"barcelona","fr":"paris"})
Before:
county code language
spain es spanish
france fr french
After:
county code language city
spain es spanish barcelona
france fr french paris

Categories

Resources