transforming single dimension dataframe into form of tables - python

I have a dataframe with four records:
Name: Bob
College Name:Boston
Name:Ready
College Name:IIT KGP
I want to transform that into form with a table that has two columns in python like:
Name College
Boob Boston
Ready IIT
The separator should be ":".

First split values of column by first :, add counter by cumcount and reshape by unstack:
df = df['col'].str.split(':', expand=True, n=1)
df.columns = ['a','b']
df1 = (df.set_index(['a',df.groupby('a').cumcount()])['b']
.unstack()
.T
.rename_axis(None, axis=1)
.reindex(columns=df['a'].unique()))
print (df1)
Name College Name
0 Bob Boston
1 Ready IIT KGP

Related

How to write vectorized functions that pull arguments from two dataframes of different size

I am putting together a new formatted dataframe that aggregates data from a different dataframe. I need to create a column in this new dataframe that filters and aggregates data from a secondary dataframe. I wrote a function to do so which filters the second dataframe based on the new column title and and the values from each row of another column in the new dataframe. The function then sums the values of a column in the secondary dataframe.
As an example.
df2 = pd.DataFrame({'name':['alan','sky','liam','liam','alan','liam','alan','sky','bryan','alan','sky']
,'age': [1,5,10,15,20,25,30,35,40,45,50]
,'values': [564,65,4,44,8,60,4,684,51,3,14]})
df1 = pd.DataFrame({'name':['alan','sky','liam','bryan']})
def get_cumsum_values(person,data,col):
value = data[data.apply(lambda x: x.age < col and x.name == person, axis = 1)].values.sum()
return value
df1['10'] = df1.apply(lambda x: get_cumsum_values(person = x.name, data = df1, col = 10))
Dealing with a ton of data, and this code takes forever. The culprit seems to be the apply method at the end to create the new column. Is there a way to use vectorization to get this done ?
Why don't you do something like the following (without any .applys):
def get_cumsum_values(names, age, data):
return (
data[data.name.isin(names) & (data.age < age)]
.groupby("name")["values"]
.sum()
.rename(str(age))
.reset_index(drop=False)
)
df1 = df1.merge(
get_cumsum_values(df1.name.unique(), 10, df2), on="name", how="left"
)
Result:
name 10
0 alan 564.0
1 sky 65.0
2 liam NaN
3 bryan NaN
Or, you set the name column of df1 as index, and then do
df1 = df1.set_index("name")
def get_cumsum_values(names, age, data):
return (
data[data.name.isin(names) & (data.age < age)]
.groupby("name")["values"]
.sum()
)
df1['10'] = get_cumsum_values(df1.index.unique(), 10, df2)
df1['35'] = get_cumsum_values(df1.index.unique(), 35, df2)
Result:
10 35
name
alan 564.0 576.0
sky 65.0 65.0
liam NaN 108.0
bryan NaN NaN

Generating a dataframe based off the diff between two dataframes

I have 2 data frames that look like this
Df1
City Code ColA Col..Z
LA LAA
LA LAB
LA LAC
Df2
Code ColA Col..Z
LA LAA
NY NYA
CH CH1
What I'm trying to do have the result of
df3
Code ColA Col..Z
NY NYA
CH CH1
Normally I would loop through each row in df2 and say:
Df3 = If df2.row['Code'] in df1 then drop it.
But I want to find a pythonic pandas way to do it instead of looping through the dataframe. I was looking at examples using joins or merging but I cant seem to work it out.
This Df3 = If df2.row['Code'] in df1 then drop it. translates to
df3 = df2[~df2['Code'].isin(df1['City'] ]
To keep only the different items in df2 based on the code column, you can do something like this, using drop_duplicates :
df2[df2.code.isin(
# the different values in df2's 'code' column
pd.concat([df1.code, df2.code]).drop_duplicates(keep=False)
)]
There is a pandas compare df method which might be relevant?:
df1 = pd.read_clipboard()
df1
df2 = pd.read_clipboard()
df2
df1.compare(df2).drop('self', axis=1, level=1).droplevel(1, axis=1)
(And I'm making an assumption you had a typo in your dataframes with the City col missing from df2?)

reshaping dataframe two columns into one column and two rows

Suppose I have the following df that I would like to reshape:
df6 = pd.DataFrame({'name':['Sara', 'John', 'Jack'],
'trip places': ['UK,UK,UK', 'US,US,US', 'AUS,AUS,AUS'],
'Trip code': ['UK322,UK454,UK4441', 'US664,US4544,US44', 'AUS11,AUS11,AUS11']
})
df6
Looks like:
name trip places Trip code
0 Sara UK,UK,UK UK322,UK454,UK4441
1 John US,US,US US664,US4544,US44
2 Jack AUS,AUS,AUS AUS11,AUS11,AUS11
I want to add a new column lets say df6['total-info'] and merge the current two columns trip places and Trip code in two rows per name, so the output will be like this:
name total-info
0 Sara UK,UK,UK
UK322,UK454,UK4441
1 John US,US,US
US664,US4544,US44
2 Jack AUS,AUS,AUS
AUS11,AUS11,AUS11
I tried to do so by many methods grouping/stack/unstack pivot .. etc but all what I tried does not generate the output I need and I am not completely familiar with the best function to do so. I also used concatenation but it generated one column and added all the two columns comma separated values altogether.
Use set_index, stack, droplevel then reset_index and specify the new column name:
df7 = (
df6
.set_index('name') # Preserve during reshaping
.stack() # Reshape
.droplevel(1) # Remove column names
.reset_index(name='total-info') # reset_index and name new column
)
df7:
name total-info
0 Sara UK,UK,UK
1 Sara UK322,UK454,UK4441
2 John US,US,US
3 John US664,US4544,US44
4 Jack AUS,AUS,AUS
5 Jack AUS11,AUS11,AUS11
Or if name is to be part of the multi-index append name and call to_frame
after stack and droplevel instead:
df7 = (
df6
.set_index('name', append=True) # Preserve during reshaping
.stack() # Reshape
.droplevel(2) # Remove column names
.to_frame(name='total-info') # Make DataFrame and name new column
)
total-info
name
0 Sara UK,UK,UK
Sara UK322,UK454,UK4441
1 John US,US,US
John US664,US4544,US44
2 Jack AUS,AUS,AUS
Jack AUS11,AUS11,AUS11

Copy values only to a new empty dataframe with column names - Pandas

I have two dataframes.
df1= pd.DataFrame({'person_id':[1,2,3],'gender': ['Male','Female','Not disclosed'],'ethnicity': ['Chinese','Indian','European']})
df2 = pd.DataFrame(columns=['pid','gen','ethn'])
As you can see, the second dataframe (df2) is empty. But may also contain few rows of data at times
What I would like to do is copy dataframe values (only) from df1 to df2 with column names of df2 remain unchanged.
I tried the below but both didn't work
df2 = df1.copy(deep=False)
df2 = df1.copy(deep=True)
How can I achieve my output to be like this? Note that I don't want the column names of df1. I only want the data
Do:
df1.columns = df2.columns.tolist()
df2 = df2.append(df1)
## OR
df2 = pd.concat([df1, df2])
Output:
pid gen ethn
0 1 Male Chinese
1 2 Female Indian
2 3 Not disclosed European
Edit based on OPs comment linking to the nature of dataframes:
df1= pd.DataFrame({'person_id':[1,2,3],'gender': ['Male','Female','Not disclosed'],'ethn': ['Chinese','Indian','European']})
df2= pd.DataFrame({'pers_id':[4,5,6],'gen': ['Male','Female','Not disclosed'],'ethnicity': ['Chinese','Indian','European']})
df3= pd.DataFrame({'son_id':[7,8,9],'sex': ['Male','Female','Not disclosed'],'ethnici': ['Chinese','Indian','European']})
final_df = pd.DataFrame(columns=['pid','gen','ethn'])
Now do:
frame = [df1, df2, df3]
for i in range(len(frame)):
frame[i].columns = final_df.columns.tolist()
final_df = final_df.append(frame[i])
print(final_df)
Output:
pid gen ethn
0 1 Male Chinese
1 2 Female Indian
2 3 Not disclosed European
0 4 Male Chinese
1 5 Female Indian
2 6 Not disclosed European
0 7 Male Chinese
1 8 Female Indian
2 9 Not disclosed European
The cleanest solution I think is to just append the df1 after its column names have been set properly:
df2 = df2.append(pd.DataFrame(df1.values, columns=df2.columns))

Pandas GroupBy query

I have a dataframe in pandas which looks like the following:
Snapshot of my pandas dataframe
Now I want the data frame to be transformed like below wherein attribute 'category' get concatenated separated by a delimiter for each customerid based on sorted date value(%m/%d/%Y). The order with earlier date has its category listed first for the corresponding customer id.
Desired/Transformed data frame
First convert column by to_datetime, then sort_values and last groupby with join:
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
df = (df.sort_values(['customerid','Age','Date'])
.groupby(['customerid','Age'])['category']
.agg(', '.join)
.reset_index())
print (df)
customerid Age category
0 1 10 Electronics, Clothing
1 2 25 Grocery, Clothing

Categories

Resources