shifting values of a column grouping by another column of the DataFrame - python

I have a dataFrame that looks like the following:
page_id content name
1 {} John
1 {cat, dog} Anne
2 {} Ethan
3 {} John
3 {sea, earth} Anne
3 {earth, green} Ethan
4 {} Mark
I need the value of the content column of each page_id to be equal to the value of the content column of the next page_id, only for the same page_ids. I suppose I need to use the shift() function al along with a group by page_id, but I don't know how to put it together.
The expected output would be:
page_id content name
1 {cat, dog} John
1 NaN Anne
2 NaN Ethan
3 {sea, earth} John
3 {earth, green} Anne
3 NaN Ethan
4 NaN Mark
Any help on this issue will be very appreciated.

Looks like you want a groupby with shift:
df['content'] = df.groupby('page_id').content.apply(lambda x: x.shift(-1))
page_id content
0 1.0 {cat, dog}
1 NaN NaN
2 NaN NaN
3 3.0 {earth, sea}
4 3.0 {green, earth}
5 NaN NaN
6 NaN NaN

You can avoid the groupby apply given your sorting on 'page_id'. shift everything then only set the values within group using where. This will be much faster as the number of groups becomes large.
df['content'] = df.content.shift(-1).where(df.page_id.eq(df.page_id.shift(-1)))
page_id content name
0 1 {cat, dog} John
1 1 NaN Anne
2 2 NaN Ethan
3 3 {earth, sea} John
4 3 {earth, green} Anne
5 3 NaN Ethan
6 4 NaN Mark

Related

How to create new column based on average with certain conditions and ignore null in python dataframe?

I have 2 tables
date
James
Jamie
John
Allysia
Jean
2022-01-01
NaN
6
5
4
3
2022-01-02
7
6
7
NaN
5
names
groupings
James
guy
John
guy
Jamie
girl
Allysia
girl
Jean
girl
into
date
James
Jamie
John
Allysia
Jean
girl
guy
2022-01-01
NaN
6
5
4
3
5
5
2022-01-02
7
6
7
NaN
5
5.5
7
threshold= >3
I want to create a new column grouped by guys /girls scores where the score taken is above the threshold and get their mean while ignoring NaN and scores that does not fit the threshold.
I do not know on how to replace scores that is below the threshold with nan.
I tried doing to do a group by to get them in to a list and create new row with mean.
groupingseries = groupings.groupby(['grouping'])['names'].apply(list)
for k,s in zip(groupingseries.keys(),groupingseries):
try:
its='"'+',"'.join(s)+'"'
df[k]=df[s].mean()
except:
print('not in item')
Not sure why the results return NaN for girl and guy.
Please do help.
Assuming df and groupings your two input DataFrames:
out = df.join(df.groupby(df.columns.map(groupings.set_index('names')['groupings']),
axis=1).sum()
)
Output:
date James Jamie John Allysia Jean girl guy
0 2022-01-01 NaN 6 5 4.0 3 13.0 5.0
1 2022-01-02 7.0 6 7 NaN 5 11.0 14.0

How to replace columns based on another column (matching parent id with id) python

I have a dataset that looks like this
ID
TITLE
PARENT_ID
1
Tom
NaN
2
Lisa
1
3
Lecy
1
4
Ann
NaN
5
John
NaN
6
Lana
4
If Lisa's PARENT_ID is 1, then her parent is Tom. If tom has NaN parent_id, means he is not child, he is parent. So that,
I need it to look like this:
ID
TITLE
PARENT_ID
1
Tom
NaN
2
Lisa
Tom
3
Lecy
Tom
4
Ann
NaN
5
John
NaN
6
Lana
Ann
you can use map
df['PARENT_ID_FINAL']=df['ID'].map(dict(zip(df['ID'],df['TITLE'])))

Turn 4 columns into two

In Jupiter notebook, using pandas, I have a csv with 4 columns.
Names Number Names2 Number2
Jim 2 Greg 5
Meek 4 Drake 6
NaN 12 Tim 3
Neri 1 Nan 9
There are no duplicates between the two Name columns but there are NaN's.
I am looking to
Create 2 new columns that appends the 4 columns
Remove the NaN's in the process
Where there are NaN names remove the associated number aswell.
Desired Output
Names Number Names2 Number2 - NameList NumberList
Jim 2 Greg 5 Jim 2
Meek 4 Drake 6 Meek 4
NaN 12 Tim 3 Neri 1
Neri 1 Nan 9 Greg 5
Drake 6
Tim 3
I have tried using .append but whenever I append, my new NameList column ends up just being the same length as one of the original columns or the NaN's stay.
This looks like pd.wide_to_long with a little modification on the first set of Names and Number column:
d = dict(zip(['Names','Number'],['Names1','Number1']))
(pd.wide_to_long(df.rename(columns=d).reset_index()
,['Names','Number'],'index','v')
.dropna(subset=['Names']).reset_index(drop=True))
Names Number
0 Jim 2
1 Meek 4
2 Neri 1
3 Greg 5
4 Drake 6
5 Tim 3
You can try this:
df = df.replace('Nan', np.NaN)
df1 = pd.concat([pd.concat([df['Names'], df['Names2']]), pd.concat([df['Number'], df['Number2']])], axis=1).dropna().rename(columns={0: 'Nameslist', 1: 'Numberlist'}).reset_index().drop(columns=['index'])
print(df1)
Nameslist Numberlist
0 Jim 2
1 Meek 4
2 Neri 1
3 Greg 5
4 Drake 6
5 Tim 3
When you want to concatenate while ignoring the column names and index, numpy can be a handy tool:
tmp = pd.DataFrame(np.concatenate(
[df[['Names', 'Number']].dropna().values,
df[['Names2', 'Number2']].dropna().values]),
columns=['NameList', 'NumberList'])
It gives:
NameList NumberList
0 Jim 2
1 Meek 4
2 Neri 1
3 Greg 5
4 Drake 6
5 Tim 3
You can know concatenate on axis=1:
pd.concat([df, tmp], axis=1)
which gives as expected:
Names Number Names2 Number2 NameList NumberList
0 Jim 2.0 Greg 5.0 Jim 2
1 Meek 4.0 Drake 6.0 Meek 4
2 NaN 12.0 Tim 3.0 Neri 1
3 Neri 1.0 NaN 9.0 Greg 5
4 NaN NaN NaN NaN Drake 6
5 NaN NaN NaN NaN Tim 3
try this,
(pd.concat([df,
pd.DataFrame(
{x.replace("2", ""): df.pop(x)
for x in ['Names2', 'Number2']})])) \
.replace('Nan', np.NaN).dropna()
output,
Names Number
0 Jim 2
1 Meek 4
3 Neri 1
0 Greg 5
1 Drake 6
2 Tim 3

How to merge only a specific data frame column in pandas?

I've been trying to use the pd.merge function properly but I either receive an error or get the table formatted in a way I don't like. I looked through the documentation but I can't find a way to only merge a specific column. For instance lets say I'm working with these two dataframes.
df_1 = county_name accidents pedestrians
ADAMS 1 2
ALLEGHENY 1 3
ARMSTRONG 3 4
BEDFORD 1 1
df_2 = county_name population
ADAMS 102336
ALLEGHENY 1223048
ARMSTRONG 65642
BEDFORD 166140
BERKS 48480
BLAIR 417854
BRADFORD 123457
BUCKS 60853
CAMBRIA 628341
The outcome im looking for is something like this. Where the county names are added to the 'county_name' column but not duplicated and the 'population' column is left off.
df_outcome = county_name accidents pedestrians
ADAMS 1 2
ALLEGHENY 1 3
ARMSTRONG 3 4
BEDFORD 1 1
BERKS Nan Nan
BLAIR Nan Nan
BRADFORD Nan Nan
BUCKS Nan Nan
CAMBRIA Nan Nan
Lastly, I plan to use df_outcome.fillna(0) to replace all the Nan values with zero.
Filter column county_name and use merge with left join:
df = df_2[['county_name']].merge(df_1, how='left')
print (df)
county_name accidents pedestrians
0 ADAMS 1.0 2.0
1 ALLEGHENY 1.0 3.0
2 ARMSTRONG 3.0 4.0
3 BEDFORD 1.0 1.0
4 BERKS NaN NaN
5 BLAIR NaN NaN
6 BRADFORD NaN NaN
7 BUCKS NaN NaN
8 CAMBRIA NaN NaN
Try:
df = pd.merge(df1,df2[['county_name']], how='left')

Pandas Dataframe: split column into multiple columns, right-align inconsistent cell entries

I have a pandas dataframe with a column named 'City, State, Country'. I want to separate this column into three new columns, 'City, 'State' and 'Country'.
0 HUN
1 ESP
2 GBR
3 ESP
4 FRA
5 ID, USA
6 GA, USA
7 Hoboken, NJ, USA
8 NJ, USA
9 AUS
Splitting the column into three columns is trivial enough:
location_df = df['City, State, Country'].apply(lambda x: pd.Series(x.split(',')))
However, this creates left-aligned data:
0 1 2
0 HUN NaN NaN
1 ESP NaN NaN
2 GBR NaN NaN
3 ESP NaN NaN
4 FRA NaN NaN
5 ID USA NaN
6 GA USA NaN
7 Hoboken NJ USA
8 NJ USA NaN
9 AUS NaN NaN
How would one go about creating the new columns with the data right-aligned? Would I need to iterate through every row, count the number of commas and handle the contents individually?
I'd do something like the following:
foo = lambda x: pd.Series([i for i in reversed(x.split(','))])
rev = df['City, State, Country'].apply(foo)
print rev
0 1 2
0 HUN NaN NaN
1 ESP NaN NaN
2 GBR NaN NaN
3 ESP NaN NaN
4 FRA NaN NaN
5 USA ID NaN
6 USA GA NaN
7 USA NJ Hoboken
8 USA NJ NaN
9 AUS NaN NaN
I think that gets you what you want but if you also want to pretty things up and get a City, State, Country column order, you could add the following:
rev.rename(columns={0:'Country',1:'State',2:'City'},inplace=True)
rev = rev[['City','State','Country']]
print rev
City State Country
0 NaN NaN HUN
1 NaN NaN ESP
2 NaN NaN GBR
3 NaN NaN ESP
4 NaN NaN FRA
5 NaN ID USA
6 NaN GA USA
7 Hoboken NJ USA
8 NaN NJ USA
9 NaN NaN AUS
Assume you have the column name as target
df[["City", "State", "Country"]] = df["target"].str.split(pat=",", expand=True)
Since you are dealing with strings I would suggest the amendment to your current code i.e.
location_df = df[['City, State, Country']].apply(lambda x: pd.Series(str(x).split(',')))
I got mine to work by testing one of the columns but give this one a try.

Categories

Resources