Turn 4 columns into two

Turn 4 columns into two - python

In Jupiter notebook, using pandas, I have a csv with 4 columns.
Names Number Names2 Number2
Jim 2 Greg 5
Meek 4 Drake 6
NaN 12 Tim 3
Neri 1 Nan 9
There are no duplicates between the two Name columns but there are NaN's.
I am looking to
Create 2 new columns that appends the 4 columns
Remove the NaN's in the process
Where there are NaN names remove the associated number aswell.
Desired Output
Names Number Names2 Number2 - NameList NumberList
Jim 2 Greg 5 Jim 2
Meek 4 Drake 6 Meek 4
NaN 12 Tim 3 Neri 1
Neri 1 Nan 9 Greg 5
Drake 6
Tim 3
I have tried using .append but whenever I append, my new NameList column ends up just being the same length as one of the original columns or the NaN's stay.

This looks like pd.wide_to_long with a little modification on the first set of Names and Number column:
d = dict(zip(['Names','Number'],['Names1','Number1']))
(pd.wide_to_long(df.rename(columns=d).reset_index()
,['Names','Number'],'index','v')
.dropna(subset=['Names']).reset_index(drop=True))
Names Number
0 Jim 2
1 Meek 4
2 Neri 1
3 Greg 5
4 Drake 6
5 Tim 3

You can try this:
df = df.replace('Nan', np.NaN)
df1 = pd.concat([pd.concat([df['Names'], df['Names2']]), pd.concat([df['Number'], df['Number2']])], axis=1).dropna().rename(columns={0: 'Nameslist', 1: 'Numberlist'}).reset_index().drop(columns=['index'])
print(df1)
Nameslist Numberlist
0 Jim 2
1 Meek 4
2 Neri 1
3 Greg 5
4 Drake 6
5 Tim 3

When you want to concatenate while ignoring the column names and index, numpy can be a handy tool:
tmp = pd.DataFrame(np.concatenate(
[df[['Names', 'Number']].dropna().values,
df[['Names2', 'Number2']].dropna().values]),
columns=['NameList', 'NumberList'])
It gives:
NameList NumberList
0 Jim 2
1 Meek 4
2 Neri 1
3 Greg 5
4 Drake 6
5 Tim 3
You can know concatenate on axis=1:
pd.concat([df, tmp], axis=1)
which gives as expected:
Names Number Names2 Number2 NameList NumberList
0 Jim 2.0 Greg 5.0 Jim 2
1 Meek 4.0 Drake 6.0 Meek 4
2 NaN 12.0 Tim 3.0 Neri 1
3 Neri 1.0 NaN 9.0 Greg 5
4 NaN NaN NaN NaN Drake 6
5 NaN NaN NaN NaN Tim 3

try this,
(pd.concat([df,
pd.DataFrame(
{x.replace("2", ""): df.pop(x)
for x in ['Names2', 'Number2']})])) \
.replace('Nan', np.NaN).dropna()
output,
Names Number
0 Jim 2
1 Meek 4
3 Neri 1
0 Greg 5
1 Drake 6
2 Tim 3

Related

Setting subset of a pandas DataFrame by a DataFrame

I feel like this question has been asked a millions times before, but I just can't seem to get it to work or find a SO-post answering my question.
So I am selecting a subset of a pandas DataFrame and want to change these values individually.
I am subselecting my DataFrame like this:
df.loc[df[key].isnull(), [keys]]
which works perfectly. If I try and set all values to the same value such as
df.loc[df[key].isnull(), [keys]] = 5
it works as well. But if I try and set it to a DataFrame it does not, however no error is produced either.
So for example I have a DataFrame:
data = [['Alex',10,0,0,2],['Bob',12,0,0,1],['Clarke',13,0,0,4],['Dennis',64,2],['Jennifer',56,1],['Tom',95,5],['Ellen',42,2],['Heather',31,3]]
df1 = pd.DataFrame(data,columns=['Name','Age','Amount_of_cars','cars_per_year','some_other_value'])
Name Age Amount_of_cars cars_per_year some_other_value
0 Alex 10 0 0.0 2.0
1 Bob 12 0 0.0 1.0
2 Clarke 13 0 0.0 4.0
3 Dennis 64 2 NaN NaN
4 Jennifer 56 1 NaN NaN
5 Tom 95 5 NaN NaN
6 Ellen 42 2 NaN NaN
7 Heather 31 3 NaN NaN
and a second DataFrame:
data = [[2/64,5],[1/56,1],[5/95,7],[2/42,5],[3/31,7]]
df2 = pd.DataFrame(data,columns=['cars_per_year','some_other_value'])
cars_per_year some_other_value
0 0.031250 5
1 0.017857 1
2 0.052632 7
3 0.047619 5
4 0.096774 7
and I would like to replace those nans with the second DataFrame
df1.loc[df1['cars_per_year'].isnull(),['cars_per_year','some_other_value']] = df2
Unfortunately this does not work as the index does not match. So how do I ignore the index, when setting values?
Any help would be appreciated. Sorry if this has been posted before.

It is possible only if number of mising values is same like number of rows in df2, then assign array for prevent index alignment:
df1.loc[df1['cars_per_year'].isnull(),['cars_per_year','some_other_value']] = df2.values
print (df1)
Name Age Amount_of_cars cars_per_year some_other_value
0 Alex 10 0 0.000000 2.0
1 Bob 12 0 0.000000 1.0
2 Clarke 13 0 0.000000 4.0
3 Dennis 64 2 0.031250 5.0
4 Jennifer 56 1 0.017857 1.0
5 Tom 95 5 0.052632 7.0
6 Ellen 42 2 0.047619 5.0
7 Heather 31 3 0.096774 7.0
If not, get errors like:
#4 rows assigned to 5 rows
data = [[2/64,5],[1/56,1],[5/95,7],[2/42,5]]
df2 = pd.DataFrame(data,columns=['cars_per_year','some_other_value'])
df1.loc[df1['cars_per_year'].isnull(),['cars_per_year','some_other_value']] = df2.values
ValueError: shape mismatch: value array of shape (4,) could not be broadcast to indexing result of shape (5,)
Another idea is set index of df2 by index of filtered rows in df1:
df2 = df2.set_index(df1.index[df1['cars_per_year'].isnull()])
df1.loc[df1['cars_per_year'].isnull(),['cars_per_year','some_other_value']] = df2
print (df1)
Name Age Amount_of_cars cars_per_year some_other_value
0 Alex 10 0 0.000000 2.0
1 Bob 12 0 0.000000 1.0
2 Clarke 13 0 0.000000 4.0
3 Dennis 64 2 0.031250 5.0
4 Jennifer 56 1 0.017857 1.0
5 Tom 95 5 0.052632 7.0
6 Ellen 42 2 0.047619 5.0
7 Heather 31 3 0.096774 7.0

Just add .values or .to_numpy() if using pandas v 0.24 +
df1.loc[df1['cars_per_year'].isnull(),['cars_per_year','some_other_value']] = df2.values
Name Age Amount_of_cars cars_per_year some_other_value
0 Alex 10 0 0.000000 2.0
1 Bob 12 0 0.000000 1.0
2 Clarke 13 0 0.000000 4.0
3 Dennis 64 2 0.031250 5.0
4 Jennifer 56 1 0.017857 1.0
5 Tom 95 5 0.052632 7.0
6 Ellen 42 2 0.047619 5.0
7 Heather 31 3 0.096774 7.0

shifting values of a column grouping by another column of the DataFrame

I have a dataFrame that looks like the following:
page_id content name
1 {} John
1 {cat, dog} Anne
2 {} Ethan
3 {} John
3 {sea, earth} Anne
3 {earth, green} Ethan
4 {} Mark
I need the value of the content column of each page_id to be equal to the value of the content column of the next page_id, only for the same page_ids. I suppose I need to use the shift() function al along with a group by page_id, but I don't know how to put it together.
The expected output would be:
page_id content name
1 {cat, dog} John
1 NaN Anne
2 NaN Ethan
3 {sea, earth} John
3 {earth, green} Anne
3 NaN Ethan
4 NaN Mark
Any help on this issue will be very appreciated.

Looks like you want a groupby with shift:
df['content'] = df.groupby('page_id').content.apply(lambda x: x.shift(-1))
page_id content
0 1.0 {cat, dog}
1 NaN NaN
2 NaN NaN
3 3.0 {earth, sea}
4 3.0 {green, earth}
5 NaN NaN
6 NaN NaN

You can avoid the groupby apply given your sorting on 'page_id'. shift everything then only set the values within group using where. This will be much faster as the number of groups becomes large.
df['content'] = df.content.shift(-1).where(df.page_id.eq(df.page_id.shift(-1)))
page_id content name
0 1 {cat, dog} John
1 1 NaN Anne
2 2 NaN Ethan
3 3 {earth, sea} John
4 3 {earth, green} Anne
5 3 NaN Ethan
6 4 NaN Mark

Creating dataframe from another dataframe and list

I have a list of names like this
names= [Josh,Jon,Adam,Barsa,Fekse,Bravo,Talyo,Zidane]
and i have a dataframe like this
Number Names
0 1 Josh
1 2 Jon
2 3 Adam
3 4 Barsa
4 5 Fekse
5 6 Barsa
6 7 Barsa
7 8 Talyo
8 9 Jon
9 10 Zidane
i want to create a dataframe that will have all the names in names list and the corresponding numbers from this dataframe grouped, for the names that does not have corresponding numbers there should be an asterisk like below
Names Number
Josh 1
Jon 2,9
Adam 3
Barsa 4,6,7
Fekse 5
Bravo *
Talyo 8
Zidane 10
Do we have any built in functions to get this done

You can use GroupBy with str.join, then reindex with your names list:
res = df.groupby('Names')['Number'].apply(lambda x: ','.join(map(str, x))).to_frame()\
.reindex(names).fillna('*').reset_index()
print(res)
Names Number
0 Josh 1
1 Jon 2,9
2 Adam 3
3 Barsa 4,6,7
4 Fekse 5
5 Bravo *
6 Talyo 8
7 Zidane 10

Function to determine window in a rolling function

I have a dataframe in which I want to apply a rolling mean over a column of numbers that come in 3-pairs where I only want 4 unique values to go into the mean.
Lets say my dataframe looks like:
Group Column to roll
1 9
2 5
2 5
2 4
2 4
2 4
2 3
2 3
2 3
2 6
2 6
2 6
2 8
Since I want 4 unique values to go into the mean but all values to be of equal weight and within the same group, my expected output (assuming I need 4 unique values) would be:
Group Output
1 nan
2 nan
2 nan
2 nan
2 nan
2 nan
2 nan
2 nan
2 nan
2 (6+3+4+5)/4
2 (6+3+4+5)/4
2 (6+3+4+5)/4
2 (8+6+3+4)/4
Any ideas how to do this?

You could try something like this:
df['Column to roll'].drop_duplicates().rolling(4).mean().reindex(df.index).ffill()
Output:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 4.50
9 4.50
10 4.50
11 5.25
Name: Column to roll, dtype: float64
Edit question changed
df_out = df.groupby('Group')['Column to roll']\
.apply(lambda x: x.drop_duplicates().rolling(4).mean()).rename('Output')
df.set_index('Group',append=True).swaplevel(0,1)\
.join(df_out, how='left').ffill().reset_index(level=1, drop=True)
Output:
Column to roll Output
Group
1 9 NaN
2 5 NaN
2 5 NaN
2 4 NaN
2 4 NaN
2 4 NaN
2 3 NaN
2 3 NaN
2 3 NaN
2 6 4.50
2 6 4.50
2 6 4.50
2 8 5.25

Comparing string entries in two Pandas series

I have two panda series, and would simply like to compare their string values, and returning the strings (and maybe indices too) of the values they have in common e.g. Hannah, Frank and Ernie in the example below::
print(x)
print(y)
0 Anne
1 Beth
2 Caroline
3 David
4 Ernie
5 Frank
6 George
7 Hannah
Name: 0, dtype: object
1 Hannah
2 Frank
3 Ernie
4 NaN
5 NaN
6 NaN
7 NaN
Doing
x == y
throws a
ValueError: Can only compare identically-labeled Series objects
as does
x.sort_index(axis=0) == y.sort_index(axis=0)
and
x.reindex_like(y) > y
does something, but not the right thing!

If need common values only you can use convert first column to set and use intersection:
a = set(x).intersection(y)
print (a)
{'Hannah', 'Frank', 'Ernie'}
And for indices need merge by default inner join with reset_index for convert indices to columns:
df = pd.merge(x.rename('a').reset_index(), y.rename('a').reset_index(), on='a')
print (df)
index_x a index_y
0 4 Ernie 3
1 5 Frank 2
2 7 Hannah 1
Detail:
print (x.rename('a').reset_index())
index a
0 0 Anne
1 1 Beth
2 2 Caroline
3 3 David
4 4 Ernie
5 5 Frank
6 6 George
7 7 Hannah
print (y.rename('a').reset_index())
index a
0 1 Hannah
1 2 Frank
2 3 Ernie
3 4 NaN
4 5 NaN
5 6 NaN
6 7 NaN

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Turn 4 columns into two - python

try this, (pd.concat([df, pd.DataFrame( {x.replace("2", ""): df.pop(x) for x in ['Names2', 'Number2']})])) \ .replace('Nan', np.NaN).dropna() output, Names Number 0 Jim 2 1 Meek 4 3 Neri 1 0 Greg 5 1 Drake 6 2 Tim 3

Related

Setting subset of a pandas DataFrame by a DataFrame

shifting values of a column grouping by another column of the DataFrame

Creating dataframe from another dataframe and list

Function to determine window in a rolling function

Comparing string entries in two Pandas series

Categories

Resources