I have a dataframe which looks like:
df =
|Name Nationality Family etc.....
0|John Born in Spain. Wife
1|nan But live in England son
2|nan nan daughter
Some columns only have one row but others have multiple answers over a few rows, how could i merge the rows onto each other so it would look something like the below:
df =
|Name Nationality Family etc....
0|John Born in Spain. But live in England Wife Son Daughter
Perhaps this will do it for you:
import pandas as pd
# your dataframe
df = pd.DataFrame(
{'Name': ['John', np.nan, np.nan],
'Nationality': ['Born in Spain.', 'But live in England', np.nan],
'Family': ['Wife', 'son', 'daughter']})
def squeeze_df(df):
new_df = {}
for col in df.columns:
new_df[col] = [df[col].str.cat(sep=' ')]
return pd.DataFrame(new_df)
squeeze_df(df)
# >> out:
# Name Nationality Family
# 0 John Born in Spain. But live in England Wife son daughter
I made the assumption that you only need to do this for one single person (i.e. squeezing/joining the rows of the dataframe into a single row). Also, what does "etc...." mean? For example, will you have integer or floating point values in the dataframe?
Related
I have a dataframe (df1) that looks like this:
first_name
last_name
affiliation
jean
dulac
University of Texas
peter
huta
University of Maryland
I want to match this dataframe to another one that contains several potential matches. Each potential match has a first and last name and also a list of all the affiliations this person was associated with, and I want to use the information in this affiliation column to differentiate between my potential matches and keep only the most likely one.
The second dataframe has the following form:
first_name
last_name
affiliations_all
jean
dulac
[{'city_name': 'Kyoto', 'country_name': 'Japan', 'name': 'Kyoto University'}]
jean
dulac
[{'city_name': 'Texas', 'country_name': 'USA', 'name': 'University of Texas'}]
The column affiliations_all is apparently saved as a pandas.core.series.Series (and I can't change that since it comes from an API query).
I am thinking that one way to match the 2 dataframes would be to remove the words like "university" and "of" from the affiliation column of the first dataframe (that's easy), do the same for the affiliations_all column of the second dataframe (don't know how to do that) and then run some version of
test.apply(lambda x: str(x.affiliation) in str(x.affiliations_all), axis=1)
adapted to the fact that affiliations_all is a series.Series.
Any idea how to do that?
Thanks!
One possible solution would be to transform df2 (expand the columns) and then merge df1 with df2:
# transform df2
df2 = df2.explode("affiliations_all")
df2 = pd.concat([df2, df2.pop("affiliations_all").apply(pd.Series)], axis=1)
df2 = df2.rename(columns={"name": "affiliation"})
print(df2)
This prints:
first_name last_name city_name country_name affiliation
0 jean dulac Kyoto Japan Kyoto University
1 jean dulac Texas USA University of Texas
And the seconds step will be merge df1 with transformed df2:
df_out = pd.merge(df1, df2, on=["first_name", "last_name", "affiliation"])
print(df_out)
Prints:
first_name last_name affiliation city_name country_name
0 jean dulac University of Texas Texas USA
let's say I have the below dataframe:
dataframe = pd.DataFrame({'col1': ['Name', 'Location', 'Phone','Name', 'Location'],
'Values': ['Mark', 'New York', '656','John', 'Boston']})
which looks like this:
col1 Values
Name Mark
Location New York
Phone 656
Name John
Location Boston
As you can see I have my wanted columns as rows in col1 and not all values have a Phone number, is there a way for me to transform this dataframe to look like this:
Name Location Phone
Mark New York 656
John Boston NaN
I have tried to transpose in Excel, do a Pivot and a Pivot_Table:
pivoted = pd.pivot_table(data = dataframe, values='Values', columns='col1')
But this comes out incorrectly. any help would be appreciated on this.
NOTES: All new section start with the Name value and end before the Name value of the next person.
Create a new index using cumsum to identify unique sections then do pivot as usual...
df['index'] = df['col1'].eq('Name').cumsum()
df.pivot('index', 'col1', 'Values')
col1 Location Name Phone
index
1 New York Mark 656
2 Boston John NaN
I'm sorry if I can't explain properly the issue I'm facing since I don't really understand it that much. I'm starting to learn Python and to practice I try to do projects that I face in my day to day job, but using Python. Right now I'm stuck with a project and would like some help or guidance, I have a dataframe that looks like this
Index Country Name IDs
0 USA John PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39
--------------------------------------------
1 UK Jane PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40
(I apologize since I can't create a table on this post since the separator of the ids is a | ) but you get the idea, every person has 4 IDs and they are all on the same "cell" of the dataframe, each ID separated from its value by pipes, I need to split those ID's from their values, and put them on separate columns so I get something like this
index
Country
Name
PERSID
SSO
STARTDATE
WAVE
0
USA
John
12345
John123
20210101
WAVE39
1
UK
Jane
25478
Jane123
20210101
WAVE40
Now, adding to the complexity of the table itself, I have another issues, for example, the order of the ID's won't be the same for everyone and some of them will be missing some of the ID's.
I honestly have no idea where to begin, the first thing I thought about trying was to split the IDs column by spaces and then split the result of that by pipes, to create a dictionary, convert it to a dataframe and then join it to my original dataframe using the index.
But as I said, my knowledge in python is quite pathetic, so that failed catastrophically, I only got to the first step of that plan with a Client_ids = df.IDs.str.split(), that returns a series with the IDs separated one from each other like ['PERSID|12345', 'SSO|John123', 'STARTDATE|20210101', 'WAVE|Wave39'] but I can't find a way to split it again because I keep getting an error saying the the list object doesn't have attribute 'split'
How should I approach this? what alternatives do I have to do it?
Thank you in advance for any help or recommendation
You have a few options to consider to do this. Here's how I would do it.
I will split the values in IDs by \n and |. Then create a dictionary with key:value for each split of values of |. Then join it back to the dataframe and drop the IDs and temp columns.
import pandas as pd
df = pd.DataFrame([
["USA", "John","""PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39"""],
["UK", "Jane", """PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40"""],
["CA", "Jill", """PERSID|12345
STARTDATE|20210201
WAVE|WAVE41"""]], columns=['Country', 'Name', 'IDs'])
df['temp'] = df['IDs'].str.split('\n|\|').apply(lambda x: {k:v for k,v in zip(x[::2],x[1::2])})
df = df.join(pd.DataFrame(df['temp'].values.tolist(), df.index))
df = df.drop(columns=['IDs','temp'],axis=1)
print (df)
With this approach, it does not matter if a row of data is missing. It will sort itself out.
The output of this will be:
Original DataFrame:
Country Name IDs
0 USA John PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39
1 UK Jane PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40
2 CA Jill PERSID|12345
STARTDATE|20210201
WAVE|WAVE41
Updated DataFrame:
Country Name PERSID SSO STARTDATE WAVE
0 USA John 12345 John123 20210101 WAVE39
1 UK Jane 25478 Jane123 20210101 WAVE40
2 CA Jill 12345 NaN 20210201 WAVE41
Note that Jill did not have a SSO value. It set the value to NaN by default.
First generate your dataframe
df1 = pd.DataFrame([["USA", "John","""PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39"""],
["UK", "Jane", """
PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40"""]], columns=['Country', 'Name', 'IDs'])
Then split the last cell using lambda
df2 = pd.DataFrame(list(df.apply(lambda r: {p:q for p,q in [x.split("|") for x in r.IDs.split()]}, axis=1).values))
Lastly concat the dataframes together.
df = pd.concat([df1, df2], axis=1)
Quick solution
remove_word = ["PERSID", "SSO" ,"STARTDATE" ,"WAVE"]
for i ,col in enumerate(remove_word):
df[col] = df.IDs.str.replace('|'.join(remove_word), '', regex=True).str.split("|").str[i+1]
Use regex named capture groups with pd.String.str.extract
def ng(x):
return f'(?:{x}\|(?P<{x}>[^\n]+))?\n?'
fields = ['PERSID', 'SSO', 'STARTDATE', 'WAVE']
pat = ''.join(map(ng, fields))
df.drop('IDs', axis=1).join(df['IDs'].str.extract(pat))
Country Name PERSID SSO STARTDATE WAVE
0 USA John 12345 John123 20210101 WAVE39
1 UK Jane 25478 Jane123 20210101 WAVE40
2 CA Jill 12345 NaN 20210201 WAVE41
Setup
Credit to #JoeFerndz for sample df.
NOTE: this sample has missing values in some 'IDs'.
df = pd.DataFrame([
["USA", "John","""PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39"""],
["UK", "Jane", """PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40"""],
["CA", "Jill", """PERSID|12345
STARTDATE|20210201
WAVE|WAVE41"""]], columns=['Country', 'Name', 'IDs'])
I'm trying to test if one column (surname) is a substring of another column (name) in a dataframe (el). I've tried the following but python doesn't like it
el.name.str.contains(el.surname)
I can see plenty of examples of how to search for a literal substring, but not where the substring is a column.
Going mad on this one, help please!
Dave
You may use
import pandas as pd
dct = {'surname': ['Smith', 'Miller', 'Mayer'],
'name': ['Dr. John Smith', 'Nobody', 'Prof. Dr. Mayer']}
df = pd.DataFrame(dct)
df['is_part_of_name'] = df.apply(lambda x: x["surname"] in x["name"], axis=1)
print(df)
Which yields
surname name is_part_of_name
0 Smith Dr. John Smith True
1 Miller Nobody False
2 Mayer Prof. Dr. Mayer True
I have the following data frame:
population GDP
country
United Kingdom 4.5m 10m
Spain 3m 8m
France 2m 6m
I also have the following information in a 2 column dataframe(happy for this to be made into another datastruct if that will be more beneficial as the plan is that it will be sorted in a VARS file.
county code
Spain es
France fr
United Kingdom uk
The 'mapping' datastruct will be sorted in a random order as countries will be added/removed at random times.
What is the best way to re-index the data frame to its country code from its country name?
Is there a smart solution that would also work on other columns so for example if a data frame was indexed on date but one column was df['county'] then you could change df['country'] to its country code? Finally is there a third option that would add an additional column that was either country/code which selected the right code based on a country name in another column?
I think you can use Series.map, but it works only with Series, so need Index.to_series. Last rename_axis (new in pandas 0.18.0):
df1.index = df1.index.to_series().map(df2.set_index('county').code)
df1 = df1.rename_axis('county')
#pandas bellow 0.18.0
#df1.index.name = 'county'
print (df1)
population GDP
county
uk 4.5m 10m
es 3m 8m
fr 2m 6m
It is same as mapping by dict:
d = df2.set_index('county').code.to_dict()
print (d)
{'France': 'fr', 'Spain': 'es', 'United Kingdom': 'uk'}
df1.index = df1.index.to_series().map(d)
df1 = df1.rename_axis('county')
#pandas bellow 0.18.0
#df1.index.name = 'county'
print (df1)
population GDP
county
uk 4.5m 10m
es 3m 8m
fr 2m 6m
EDIT:
Another solution with Index.map, so to_series is omitted:
d = df2.set_index('county').code.to_dict()
print (d)
{'France': 'fr', 'Spain': 'es', 'United Kingdom': 'uk'}
df1.index = df1.index.map(d.get)
df1 = df1.rename_axis('county')
#pandas bellow 0.18.0
#df1.index.name = 'county'
print (df1)
population GDP
county
uk 4.5m 10m
es 3m 8m
fr 2m 6m
Here are some brief ways to approach your 3 questions. More details below:
1) How to change index based on mapping in separate df
Use df_with_mapping.todict("split") to create a dictionary, then use a list comprehension to change it into {"old1":"new1",...,"oldn":"newn"} form then use df.index = df.base_column.map(dictionary) to get the changed index.
2) How to change index if the new column is in the same df:
df.index = df["column_you_want"]
3) Creating a new column by mapping on a old column:
df["new_column"] = df["old_column"].map({"old1":"new1",...,"oldn":"newn"})
1) Mapping for the current index exists in separate dataframe but you don't have the mapped column in the dataframe yet
This is essentially the same as question 2 with the additional step of creating a dictionary for the mapping you want.
#creating the mapping dictionary in the form of current index : future index
df2 = pd.DataFrame([["es"],["fr"]],index = ["spain","france"])
interm_dict = df2.to_dict("split") #Creates a dictionary split into column labels, data labels and data
mapping_dict = {country:data[0] for country,data in zip(interm_dict["index"],interm_dict['data'])}
#We only want the first column of the data and the index so we need to make a new dict with a list comprehension and zip
df["country"] = df.index #Create a new column if u want to save the index
df.index = pd.Series(df.index).map(mapping_dict) #change the index
df.index.name = "" #Blanks out index name
df = df.drop("county code",1) #Drops the county code column to avoid duplicate columns
Before:
county code language
spain es spanish
france fr french
After:
language country
es spanish spain
fr french france
2) Changing the current index to one of the columns already in the dataframe
df = pd.DataFrame([["es","spanish"],["fr","french"]], columns = ["county code","language"], index = ["spain", "french"])
df["country"] = df.index #if you want to save the original index
df.index = df["county code"] #The only step you actually need
df.index.name = "" #if you want a blank index name
df = df.drop("county code",1) #if you dont want the duplicate column
Before:
county code language
spain es spanish
french fr french
After:
language country
es spanish spain
fr french french
3) Creating an additional column based on another column
This is again essentially the same as step 2 except we create an additional column instead of assigning .index to the created series.
df = pd.DataFrame([["es","spanish"],["fr","french"]], columns = ["county code","language"], index = ["spain", "france"])
df["city"] = df["county code"].map({"es":"barcelona","fr":"paris"})
Before:
county code language
spain es spanish
france fr french
After:
county code language city
spain es spanish barcelona
france fr french paris