How to compress or stack a pandas dataframe along the rows? - python

I have a large pandas dataframe with several columns, however lets focus on two:
df = pd.DataFrame([['hey how are you', 'fine thanks',1],
['good to know', 'yes, and you',2],
['I am fine','ok',3],
['see you','bye!',4]],columns=list('ABC'))
df
Out:
A B C
0 hey how are you fine thanks 1
1 good to know yes, and you 2
2 I am fine ok 3
3 see you bye! 4
From the previous data frame how can I compress two specific columns into a single pandas dataframe carrying out the values of the other columns? For example:
A C
0 hey how are you 1
1 fine thanks 1
2 good to know 2
3 yes, and you 2
4 I am fine 3
5 ok 3
6 see you 4
7 bye! 4
I tried to:
df = df['A'].stack()
df = df.groupby(level=0)
df
However, it doesnt work. Any idea of how to achieve the new format?

This will drop the column names, but gets the job done:
import pandas as pd
df = pd.DataFrame([['hey how are you', 'fine thanks'],
['good to know', 'yes, and you'],
['I am fine','ok'],
['see you','bye!']],columns=list('AB'))
df.stack().reset_index(drop=True)
0 hey how are you
1 fine thanks
2 good to know
3 yes, and you
4 I am fine
5 ok
6 see you
7 bye!
dtype: object
The default stack behaviour keeps the column names:
df.stack()
0 A hey how are you
B fine thanks
1 A good to know
B yes, and you
2 A I am fine
B ok
3 A see you
B bye!
dtype: object
You can select the columns for stacking if you have more of them, just use the column indexing:
df[["A", "B"]].stack()
With additional columns, things get tricky, you need to align indices by dropping one level (containing columns):
df["C"] = range(4)
stacked = df[["A", "B"]].stack()
stacked.index = stacked.index.droplevel(level=1)
stacked
0 hey how are you
0 fine thanks
1 good to know
1 yes, and you
2 I am fine
2 ok
3 see you
3 bye!
dtype: object
Now we can concat with C column:
pd.concat([stacked, df["C"]], axis=1)
0 C
0 hey how are you 0
0 fine thanks 0
1 good to know 1
1 yes, and you 1
2 I am fine 2
2 ok 2
3 see you 3
3 bye! 3

You can flatten() (or reshape(-1, )) the values of the DataFrame, which are stored as a numpy array:
pd.DataFrame(df.values.flatten(), columns=['A'])
A
0 hey how are you
1 fine thanks
2 good to know
3 yes, and you
4 I am fine
5 ok
6 see you
7 bye!
Comments: The default behaviour of np.ndarray.flatten and np.ndarray.reshape is what you want, which is to vary the column index faster than the row index in the original array. This is so-called row-major (C-style) order. To vary the row index faster than the column index, pass in order='F' for column-major, Fortran-style ordering. Docs: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.ndarray.flatten.html

What you may be looking for is pandas.concat.
It accepts "a sequence or mapping of Series, DataFrame, or Panel objects", so you can pass a list of your DataFrame objects selecting the columns (which will be pd.Series when indexed for a single column).
df3 = pd.concat([df['A'], df['B']])

Related

How Can I drop a column if the last row is nan

I have found examples of how to remove a column based on all or a threshold but I have not been able to find a solution to my particular problem which is dropping the column if the last row is nan. The reason for this is im using time series data in which the collection of data doesnt all start at the same time which is fine but if I used one of the previous solutions it would remove 95% of the dataset. I do however not want data whose most recent column is nan as it means its defunct.
A B C
nan t x
1 2 3
x y z
4 nan 6
Returns
A C
nan x
1 3
x z
4 6
You can also do something like this
df.loc[:, ~df.iloc[-1].isna()]
A C
0 NaN x
1 1 3
2 x z
3 4 6
Try with dropna
df = df.dropna(axis=1, subset=[df.index[-1]], how='any')
Out[8]:
A C
0 NaN x
1 1 3
2 x z
3 4 6
You can use .iloc, .loc and .notna() to sort out your problem.
df = pd.DataFrame({"A":[np.nan, 1,"x",4],
"B":["t",2,"y",np.nan],
"C":["x",3,"z",6]})
df = df.loc[:,df.iloc[-1,:].notna()]
You can use a boolean Series to select the column to drop
df.drop(df.loc[:,df.iloc[-1].isna()], axis=1)
Out:
A C
0 NaN x
1 1 3
2 x z
3 4 6
for i in range(temp_df.shape[1]):
if temp_df.iloc[-1,i] == 'nan':
temp_df = temp_df.drop(i,1)
This will work for you.
Basically what I'm doing here is looping over all columns and checking if last entry is 'nan', then dropping that column.
temp_df.shape[1]
this is the numbers of columns.
pandas.df.drop(i,1)
i represents the column index and 1 represents that you want to drop the column.
EDIT:
I read the other answers on this same post and it seems to me that notna would be best (I would use it), but the advantage of this method is that someone can compare anything they wish to.
Another method I found is isnull() which is a function in the pandas library which will work like this:
for i in range(temp_df.shape[1]):
if temp_df.iloc[-1,i].isnull():
temp_df = temp_df.drop(i,1)

How to concatenate partially sequential occurring rows in data frame using pandas

I have a csv as follows. which is broken into multiple rows.
like as follows
Names,text,conv_id
tim,hi,1234
jon,hello,1234
jon,how,1234
jon,are you,1234
tim,hey,1234
tim,i am good,1234
pam, me too,1234
jon,great,1234
jon,hows life,1234
So i want to concatenate the sequentially occuring elements into one row
as follows and make it more meaningful
Expected output:
Names,text,conv_id
tim,hi,1234
jon,hello how are you,1234
tim,hey i am good,1234
pam, me too,1234
jon,great hows life,1234
I tried a couple of things but I failed and couldn't do can anyone please guide me how to do this?
Thanks in advance.
You can use Series.shift + Series.cumsum
to be able to create the appropriate groups through groupby and then use join applied to each group using groupby.apply.'conv_id', an 'Names' are added to the groups so that they can be retrieved using Series.reset_index. Finally, DataFrame.reindex is used to place the columns in the initial order
groups=df['Names'].rename('groups').ne(df['Names'].shift()).cumsum()
new_df=( df.groupby([groups,'conv_id','Names'])['text']
.apply(lambda x: ','.join(x))
.reset_index(level=['Names','conv_id'])
.reindex(columns=df.columns) )
print(new_df)
Names text conv_id
1 tim hi 1234
2 jon hello,how,are you 1234
3 tim hey,i am good 1234
4 pam me too 1234
5 jon great,hows life 1234
Detail:
print(groups)
0 1
1 2
2 2
3 2
4 3
5 3
6 4
7 5
8 5
dtype: int64

How to delete all columns in DataFrame except certain ones?

Let's say I have a DataFrame that looks like this:
a b c d e f g
1 2 3 4 5 6 7
4 3 7 1 6 9 4
8 9 0 2 4 2 1
How would I go about deleting every column besides a and b?
This would result in:
a b
1 2
4 3
8 9
I would like a way to delete these using a simple line of code that says, delete all columns besides a and b, because let's say hypothetically I have 1000 columns of data.
Thank you.
In [48]: df.drop(df.columns.difference(['a','b']), 1, inplace=True)
Out[48]:
a b
0 1 2
1 4 3
2 8 9
or:
In [55]: df = df.loc[:, df.columns.intersection(['a','b'])]
In [56]: df
Out[56]:
a b
0 1 2
1 4 3
2 8 9
PS please be aware that the most idiomatic Pandas way to do that was already proposed by #Wen:
df = df[['a','b']]
or
df = df.loc[:, ['a','b']]
Another option to add to the mix. I prefer this approach for readability.
df = df.filter(['a', 'b'])
Where the first positional argument is items=[]
Bonus
You can also use a like argument or regex to filter.
Helpful if you have a set of columns like ['a_1','a_2','b_1','b_2']
You can do
df = df.filter(like='b_')
and end up with ['b_1','b_2']
Pandas documentation for filter.
there are multiple solution .
df = df[['a','b']] #1
df = df[list('ab')] #2
df = df.loc[:,df.columns.isin(['a','b'])] #3
df = pd.DataFrame(data=df.eval('a,b').T,columns=['a','b']) #4 PS:I do not recommend this method , but still a way to achieve this
Hey what you are looking for is:
df = df[["a","b"]]
You will recive a dataframe which only contains the columns a and b
If you only want to keep more columns than you're dropping put a "~" before the .isin statement to select every column except the ones you want:
df = df.loc[:, ~df.columns.isin(['a','b'])]
If you have more than two columns that you want to drop, let's say 20 or 30, you can use lists as well. Make sure that you also specify the axis value.
drop_list = ["a","b"]
df = df.drop(df.columns.difference(drop_list), axis=1)

generating a column based on the values in another column in pandas (python)

The following is a subset of a data frame:
drug_id WD
lexapro.1 flu-like symptoms
lexapro.1 dizziness
lexapro.1 headache
lexapro.14 Dizziness
lexapro.14 headaches
lexapro.23 extremely difficult
lexapro.32 cry at anything
lexapro.32 Anxiety
I need to generate a column id based on the values in drug_id as follows:
id drug_id WD
1 lexapro.1 flu-like symptoms
1 lexapro.1 dizziness
1 lexapro.1 headache
2 lexapro.14 Dizziness
2 lexapro.14 headaches
3 lexapro.23 extremely difficult
4 lexapro.32 cry at anything
4 lexapro.32 Anxiety
I think I need to group them based on drug_id and then generate id based on the size of each group. But I do not know how to do it?
The shift+cumsum pattern mentioned by Boud is good, just make sure to sort by drug_id first. So something like,
df = df.sort_values('drug_id')
df['id'] = (df['drug_id'] != df['drug_id'].shift()).cumsum()
A different approach that does not involve sorting your dataframe would be to map a number to each unique drug_id.
uid = df['drug_id'].unique()
id_map = dict((x, y) for x, y in zip(uid, range(1, len(uid)+1)))
df['id'] = df['drug_id'].map(id_map)
Use the shift+cumsum pattern:
(df.drug_id!=df.drug_id.shift()).cumsum()
Out[5]:
0 1
1 1
2 1
3 2
4 2
5 3
6 4
7 4
Name: drug_id, dtype: int32

How to df.groupby(cols).apply(my_func) for some columns, while leave a few columns not tackled?

Suppose I have a Pandas dataframe df has columns a,b,c,d...z . And I want to: df.groupby('a').apply(my_func()) for columns d-z, while leave column 'b' & 'c' unchanged . How to do that ?
I notice Pandas can apply different function to different column by passing a dict . But I have a long column list and just want parameters to set or tip to simply tell Pandas to bypass some columns and apply my_func() to rest of columns ? (Otherwise I have to build a long dict)
One simple (and general) approach is to create a view of the dataframe with the subset you are interested in (or, stated for your case, a view with all columns except the ones you want to ignore), and then use APPLY for that view.
In [116]: df
Out[116]:
a b c d f
0 one 3 0.493808 a bob
1 two 8 0.150585 b alice
2 one 6 0.641816 c michael
3 two 5 0.935653 d joe
4 one 1 0.521159 e kate
Use your favorite methods to create the view you need. You could select a range of columns like so df_view = df.ix[:,'b':'d'], but the following might be more useful for your scenario:
#I want all columns except two
cols = df.columns.tolist()
mycols = [x for x in cols if not x in ['a','f']]
df_view = df[mycols]
Apply your function to that view. (Note this doesn't yet change anything in df.)
In [158]: df_view.apply(lambda x: x /2)
Out[158]:
b c d
0 1 0.246904 20
1 4 0.075293 25
2 3 0.320908 28
3 2 0.467827 28
4 0 0.260579 24
Update the df using update()
In [156]: df.update(df_view.apply(lambda x: x/2))
In [157]: df
Out[157]:
a b c d f
0 one 1 0.246904 20 bob
1 two 4 0.075293 25 alice
2 one 3 0.320908 28 michael
3 two 2 0.467827 28 joe
4 one 0 0.260579 24 kate

Categories

Resources