Pandas transforming chronological rows to columns - python

I have a table of work experiences where each row represents a job in chronological order from the first job to the most recent job. For data science purposes I'm trying to create a new table based on this table that displays new job attributes and old job attributes on the same row. For example, the original table would be like:
uniqueID
personID
startdate
enddate
title
functions
1
A1
1/1/21
12/1/21
Analyst
data science
2
A1
1/1/22
12/1/22
Manager
admin
The new table would be something like this:
uniqueID
personID
new_title
new_function
old_title
old_function
1
A1
Analyst
data science
nan
nan
2
A1
Manager
admin
Analyst
data science
I tried to use some groupby variations but haven't been able to get this result.

If I understand correctly, you're looking for a shift:
cols = ['title', 'functions']
df[['old_' + c for c in cols]] = df.groupby('personID')[cols].shift(1)
df = df.drop(['startdate', 'enddate'], axis=1).rename({c: 'new_' + c for c in cols}, axis=1)
Output:
>>> df
uniqueID personID new_title new_functions old_title old_functions
0 1 A1 Analyst data science NaN NaN
1 2 A1 Manager admin Analyst data science

Related

Multiple similar columns with similar values

The dataframe looks like:
name education education_2 education_3
name_1 NaN some college NaN
name_2 NaN NaN graduate degree
name_3 high school NaN NaN
I just want to keep one education column. I tried to use the conditional statement and compared to each other, I got nothing but error though. I also looked through the merge solution, but in vain. Does anyone know how to deal with it using Python or pandas? Thank you in advance.
name education
name_1 some college
name_2 graduate degree
name_3 high school
One day I hope they'll have better functions for String type rows, rather than the limited support for columns currently available:
df['education'] = (df.filter(like='education') # Filters to only Education columns.
.T # Transposes
.convert_dtypes() # Converts to pandas dtypes, still somewhat in beta.
.max() # Gets the max value from the column, which will be the not-null one.
)
df = df[['name', 'education']]
print(df)
Output:
name education
0 name 1 some college
1 name 2 graduate degree
2 name 3 high school
Looping this wouldn't be too hard e.g.:
cols = ['education', 'age', 'income']
for col in cols:
df[col] = df.filter(like=col).bfill(axis=1)[col]
df = df[['name'] + cols]
You can use df.fillna to do so.
df['combine'] = df[['education','education2','education3']].fillna('').sum(axis=1)
df
name education education2 education3 combine
0 name1 NaN some college NaN some college
1 name2 NaN NaN graduate degree graduate degree
2 name3 high school NaN NaN high school
If you have a lot of columns to combine, you can try this.
df['combine'] = df[df.columns[1:]].fillna('').sum(axis=1)
use bfill to fill the empty (NaN) values
df.bfill(axis=1).drop(columns=['education 2','education 3'])
name education
0 name 1 some college
1 name 2 graduate degree
2 name 3 high school
if there are other columns in between then choose the columns to apply bfill
In essence, if you have multiple columns for education that you need to consolidate under a single column then choose the columns to which you apply the bfill. subsequently, you can delete those columns from which you back filled.
df[['education','education 2','education 3']].bfill(axis=1).drop(columns=['education 2','education 3'])

Extract specific words from dataframe

I have the following dataframe named marketing where i would like to extract out source= from the values. Is there a way to create a general regex function so that i can apply on other columns as well to extract words after equal sign?
Data
source=book,social_media=facebook,ads=Facebook
source=book,ads=Facebook,customer=2
cost=2, customer=3
Im using python and i have tried the following:
df = pd.DataFrame()
def find_keywords(row_string):
tags = [x for x in row_string if x.startswith('source=')]
return tags
df['Data'] = marketing['Data'].apply(lambda row : find_keywords(row))
May i know whether there is a more efficient way to extract and place into columns:
source social_media ads customer costs
book facebook facebook - -
book - facebook 2 -
You can split the column value of string type into dict then use pd.json_normalize to convert dict to columns.
out = pd.json_normalize(marketing['Data'].apply(lambda x: dict([map(str.strip, i.split('=')) for i in x.split(',')]))).dropna(subset='source')
print(out)
source social_media ads customer cost
0 book facebook Facebook NaN NaN
1 book NaN Facebook 2 NaN
Here's another option:
Sample dataframe marketing is:
marketing = pd.DataFrame(
{"Data": ["source=book,social_media=facebook,ads=Facebook",
"source=book,ads=Facebook,customer=2",
"cost=2, customer=3"]}
)
Data
0 source=book,social_media=facebook,ads=Facebook
1 source=book,ads=Facebook,customer=2
2 cost=2, customer=3
Now this
result = (marketing["Data"].str.split(r"\s*,\s*").explode().str.strip()
.str.split(r"\s*=\s*", expand=True).pivot(columns=0))
does produce
1
0 ads cost customer social_media source
0 Facebook NaN NaN facebook book
1 Facebook NaN 2 NaN book
2 NaN 2 3 NaN NaN
which is almost what you're looking for, except for the extra column level and the column ordering. So the following modification
result = (marketing["Data"].str.split(r"\s*,\s*").explode().str.strip()
.str.split(r"\s*=\s*", expand=True).rename(columns={0: "columns"})
.pivot(columns="columns").droplevel(level=0, axis=1))
result = result[["source", "social_media", "ads", "customer", "cost"]]
should fix that:
columns source social_media ads customer cost
0 book facebook Facebook NaN NaN
1 book NaN Facebook 2 NaN
2 NaN NaN NaN 3 2

How to collapse all rows in pandas dataframe across all columns

I am trying to collapse all the rows of a dataframe into one single row across all columns.
My data frame looks like the following:
name
job
value
bob
business
100
NAN
dentist
Nan
jack
Nan
Nan
I am trying to get the following output:
name
job
value
bob jack
business dentist
100
I am trying to group across all columns, I do not care if the value column is converted to dtype object (string).
I'm just trying to collapse all the rows across all columns.
I've tried groupby(index=0) but did not get good results.
You could apply join:
out = df.apply(lambda x: ' '.join(x.dropna().astype(str))).to_frame().T
Output:
name job value
0 bob jack business dentist 100.0
Try this:
new_df = df.agg(lambda x: x.dropna().astype(str).tolist()).str.join(' ').to_frame().T
Output:
>>> new_df
name job value
0 bob jack business dentist 100.0

Removing the rows from dataframe till the actual column names are found

I am reading tabular data from the email in the pandas dataframe.
There is no guarantee that column names will contain in the first row.Sometimes data is in the following format.The actual column names are [ID,Name and Year]
dummy1 dummy2 dummy3
test_column1 test_column2 test_column3
ID Name Year
1 John Sophomore
2 Lisa Junior
3 Ed Senior
Sometimes the column names come in the first row as expected.
ID Name Year
1 John Sophomore
2 Lisa Junior
3 Ed Senior
Once I read the HTML table from the email,how I remove the initial rows that don't contain the column names?So in the first case I would need to remove first 2 rows in the dataframe(including column row) and in the second case,i wouldn't have to remove anything.
Also,the column names can be in any sequence.
basically,I want to do in following
1.check whether once of the column names contains in one of the rows in dataframe
2.Remove the rows above
if "ID" in row:
remove the above rows
How can I achieve this?
You can first get index of valid columns and then filter and set accordingly.
df = pd.read_csv("d.csv",sep='\s+', header=None)
col_index = df.index[(df == ["ID","Name","Year"]).all(1)].item() # get columns index
df.columns = df.iloc[col_index].to_numpy() # set valid columns
df = df.iloc[col_index + 1 :] # filter data
df
ID Name Year
3 1 John Sophomore
4 2 Lisa Junior
5 3 Ed Senior
or
If you want to se ID as index
df = df.iloc[col_index + 1 :].set_index('ID')
df
Name Year
ID
1 John Sophomore
2 Lisa Junior
3 Ed Senior
Ugly but effective quick try:
id_name = df.columns[0]
df_clean = df[(df[id_name] == 'ID') | (df[id_name].dtype == 'int64')]

Get value of 'cell' in one dataframe based on another dataframe using pandas conditions

Given DF1:
Title | Origin | %
Analyst Referral 3
Analyst University 10
Manager University 1
and DF2:
Title | Referral | University
Analyst
Manager
I'm trying set the values inside DF2 based on conditions such as:
DF2['Referral'] = np.where((DF1['Title']=='Analyst') & (DF1['Origin']=='Referral')), DF1['%'], '0'
What I'm getting as a result, is all the values in DF1['%'], and Im expecting to get only the value in the row where the conditions are met.
Like this:
Title | Referral | University
Analyst 3 10
Manager 1
Also, there is probably a more efficient way of doing this, I'm open to suggestions!
just use pivot, no need for logic:
s = """Title|Origin|%
Analyst|Referral|3
Analyst|University|10
Manager|University|1"""
df = pd.read_csv(StringIO(s), sep='|')
df.pivot('Title', 'Origin', '%')
Origin Referral University
Title
Analyst 3.0 10.0
Manager NaN 1.0

Categories

Resources