How to "partially transpose" dataframe in Pandas? - python

I have csv file like this:
A,B,C,X
a,a,a,1.0
a,a,a,2.1
a,b,b,1.2
a,b,b,2.4
a,b,b,3.6
b,c,c,1.1
b,c,d,1.0
(A, B, C) is a "primary key" in this dataset, that means this set of columns should be unique. What I need to do is to find duplicates and present associated values (X column) in separate columns, like this:
A,B,C,X1,X2,X3
a,a,a,1.0,2.1,
a,b,b,1.2,2.4,3.6
I somehow know how to find duplicates and aggregate X values into tuples:
df = data.groupby(['A', 'B', 'C']).filter(lambda group: len(group) > 1).groupby(['A', 'B', 'C']).aggregate(tuple)
This is basically what I need, but I struggle with transforming it further.
I don't know how many duplicates for a given key I have in my data, so I need to find some max value and compute columns:
df['items'] = df['X'].apply(lambda x: len(x))
columns = [f'x_{i}' for i in range(1, df['X'].max() + 1)]
and then create new dataframe with new columns:
df2 = pd.DataFrame(df['RATE'].tolist(), columns=columns)
But at this point I lost index :shrug:
This page on Pandas docs suggests I should use something like this:
df.pivot(columns=columns, values=['X'])
because df already contains an index, but I get this (confusing) error:
KeyError: "None of [Index(['x_1', 'x_2'], dtype='object')] are in the [columns]"
What am I missing here?

I originally marked this as a duplicate of the infamous, but since this is a bit different, here's an answer:
(df.assign(col=df.groupby(['A','B','C']).cumcount().add(1))
.pivot_table(index=['A','B','C'], columns='col', values='X')
.add_prefix('X')
.reset_index()
)
Output:
col A B C X1 X2 X3
0 a a a 1.0 2.1 NaN
1 a b b 1.2 2.4 3.6
2 b c c 1.1 NaN NaN
3 b c d 1.0 NaN NaN
Note: this only differs to the linked question/answer in that you groupby/pivot on a set of columns, instead of one column.

Related

Pandas select data between both index and value range

I wish to set the values of a dataframe that lie between an index range and a value range to be NaN values. For example, say I have n columns, I want for every numeric data point in these columns to be set to NaN if they meet the following conditions:
The value is between -1 and 1
The index of this value is between 1 and 3
Below I have some code that is trying to do what I'm describing above, and it almost does it, it's just that it is setting these values on a copy of the original dataframe, and trying to use .loc throws the following error:
KeyError: "None of [Index([('a',), ('b',), ('c',)], dtype='object')]
are in the [columns]"
import numpy as np
import pandas as pd
np.random.seed(398)
df = pd.DataFrame(np.random.randn(5, 3), columns=['a', 'b', 'c'])
row_indexer = (df.index > 0) & (df.index < 4)
col_indexer = (df > -1) & (df < 1)
df[row_indexer][col_indexer] = np.nan
I'm sure there's a really simple solution, I just can't figure out the correct syntax.
(Additionally, I want to "extract" these filtered values (the ones I'm setting to NaN) into a second dataframe, but I'm fairly sure any solution that solves the primary question will solve this additional issue)
Any help would be appreciated
Try broadcasting with numpy:
df[row_indexer[:,None] & col_indexer] = np.nan
Output:
a b c
0 -1.810802 -0.776590 -0.495147
1 1.381038 NaN 2.334671
2 NaN -1.571401 1.011139
3 -1.200217 -1.013983 NaN
4 1.261759 0.863896 0.228914
I will do mul since True * True = True
out = df.mask(col_indexer.mul(row_indexer ,axis=0))
Out[81]:
a b c
0 -1.810802 -0.776590 -0.495147
1 1.381038 NaN 2.334671
2 NaN -1.571401 1.011139
3 -1.200217 -1.013983 NaN
4 1.261759 0.863896 0.228914

add values to multiple columns in one go with new index - pandas

df = pd.DataFrame(columns=['w','x','y','z'])
I'm trying to insert a new index row by row, and add values to certain columns.
If I were adding one value to a specific column, I could do: df.loc['a','x'] = 2
However what if I'd like to add values to several columns in one go, like this:
{'x':2, 'z':3}
is there a way to do this in pandas?
reindex and assign
df.reindex(['a']).assign(**d)
w x y z
a NaN 2 NaN 3
Where:
d = {'x':2, 'z':3}
df=pd.DataFrame(d,index=['a']).combine_first(df)
w x y z
a NaN 2 NaN 3
Use loc but selecting multiple columns and assign an iterable (like a list or tuple)
df.loc['a',['x','z']] = [2,3]
Or as suggested from #jfaccioni, in case the data is a dictionary d:
df.loc['a', list(d.keys())] = list(d.values())

Pandas: Find rows where a particular column is not NA but all other columns are

I have a DataFrame which contains a lot of NA values. I want to write a query which returns rows where a particular column is not NA but all other columns are NA.
I can get a Dataframe where all the column values are not NA easily enough:
df[df.interesting_column.notna()]
However, I cant figure out how to then say "from that DataFrame return only rows were every column that is not 'interesting_column' is NA". I can't use .dropna as all rows and columns will contain at least one NA value.
I realise this is probably embarrassingly simple. I have tried lots of .loc variations, join/merges in various configurations and I am not getting anywhere.
Any pointers before I just do a for loop over this thing would be appreciated.
You can simply use a conjunction of the conditions:
df[df.interesting_column.notna() & (df.isnull().sum(axis=1) == len(df.columns) - 1)]
df.interesting_column.notna() checks the column is non-null.
df.isnull().sum(axis=1) == len(df.columns) - 1 checks that the number of nulls in the row is the number of columns minus 1
Both conditions together mean that the entry in the column is the only one that is non-null.
The & operator lets you row-by-row "and" together two boolean columns. Right now, you are using df.interesting_column.notna() to give you a column of TRUE or FALSE values. You could repeat this for all columns, using notna() or isna() as desired, and use the & operator to combine the results.
For example, if you have columns a, b, and c, and you want to find rows where the value in columns a is not NaN and the values in the other columns are NaN, then do the following:
df[df.a.notna() & df.b.isna() & df.c.isna()]
This is clear and simple when you have a small number of columns that you know about ahead of time. But, if you have many columns, or if you don't know the column names, you would want a solution that loops over all columns and checks notna() for the interesting_column and isna() for the other columns. The solution by #AmiTavory is a clever way to achieve this. But, if you didn't know about that solution, here is a simpler approach.
for colName in df.columns:
if colName == "interesting_column":
df = df[ df[colName].notna() ]
else:
df = df[ df[colName].isna() ]
You can use:
rows = df.drop('interesting_column', axis=1).isna().all(1) & df['interesting_column'].notna()
Example (suppose c is the interesting column):
In [99]: df = pd.DataFrame({'a': [1, np.nan, 2], 'b': [1, np.nan, 3], 'c':[4, 5, np.nan]})
In [100]: df
Out[100]:
a b c
0 1.0 1.0 4.0
1 NaN NaN 5.0
2 2.0 3.0 NaN
In [101]: rows = df.drop('c', axis=1).isna().all(1) & df.c.notna()
In [102]: rows
Out[102]:
0 False
1 True
2 False
dtype: bool
In [103]: df[rows]
Out[103]:
a b c
1 NaN NaN 5.0

Pandas: create a dataframe relating a column to other two columns

I have a dataframe with three columns: A, B, C. Let's say A and B are integer series ranging from 0 to 10. I'd like to create a new data frame in which unique values of A is the index, unique values of B are the columns and each cell is the mean value C obtained at the intersection of Ai,Cj.
So for instance if we grouped the dataframe like this:
Cvalues = df.groupby(['A','B'],as_index=False).mean()
in the (i,j) position of the dataframe I'd like to create there would be:
Cvalues.loc[Cvalues.A==i].loc[Cvalues.B==j].C
What is the easiest way to do that?
You are almost there. You can either pivot your Cvalues, or better yet, directly go for pivot_table and utilize its built-in option of aggfunc.
df = pd.DataFrame({'A':[2,0,1,1,2,0,1,0],
'B':[1,2,1,0,1,2,1,1],
'C':[10,20,30,40,50,60,70,80]})
Recommended One-Liner:
res = df.pivot_table(index='A', columns='B', values='C', aggfunc='mean')
Making Your Method Work:
Cvalues = df.groupby(['A','B'],as_index=False).mean()
res = Cvalues.pivot(index='A', columns='B', values='C')
Why bother, but just in case, you can make this a little more compact:
res = df.groupby(['A','B'],as_index=False).mean().pivot(index='A', columns='B', values='C')
Here is the result of both ways:
B 0 1 2
A
0 NaN 80.0 40.0
1 40.0 50.0 NaN
2 NaN 30.0 NaN
where, at the intersection of A=2 and B=1: 30.0 = (10 + 50)/2

How to add an empty column to a dataframe?

What's the easiest way to add an empty column to a pandas DataFrame object? The best I've stumbled upon is something like
df['foo'] = df.apply(lambda _: '', axis=1)
Is there a less perverse method?
If I understand correctly, assignment should fill:
>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
>>> df
A B
0 1 2
1 2 3
2 3 4
>>> df["C"] = ""
>>> df["D"] = np.nan
>>> df
A B C D
0 1 2 NaN
1 2 3 NaN
2 3 4 NaN
To add to DSM's answer and building on this associated question, I'd split the approach into two cases:
Adding a single column: Just assign empty values to the new columns, e.g. df['C'] = np.nan
Adding multiple columns: I'd suggest using the .reindex(columns=[...]) method of pandas to add the new columns to the dataframe's column index. This also works for adding multiple new rows with .reindex(rows=[...]). Note that newer versions of Pandas (v>0.20) allow you to specify an axis keyword rather than explicitly assigning to columns or rows.
Here is an example adding multiple columns:
mydf = mydf.reindex(columns = mydf.columns.tolist() + ['newcol1','newcol2'])
or
mydf = mydf.reindex(mydf.columns.tolist() + ['newcol1','newcol2'], axis=1) # version > 0.20.0
You can also always concatenate a new (empty) dataframe to the existing dataframe, but that doesn't feel as pythonic to me :)
I like:
df['new'] = pd.Series(dtype='int')
# or use other dtypes like 'float', 'object', ...
If you have an empty dataframe, this solution makes sure that no new row containing only NaN is added.
Specifying dtype is not strictly necessary, however newer Pandas versions produce a DeprecationWarning if not specified.
an even simpler solution is:
df = df.reindex(columns = header_list)
where "header_list" is a list of the headers you want to appear.
any header included in the list that is not found already in the dataframe will be added with blank cells below.
so if
header_list = ['a','b','c', 'd']
then c and d will be added as columns with blank cells
Starting with v0.16.0, DF.assign() could be used to assign new columns (single/multiple) to a DF. These columns get inserted in alphabetical order at the end of the DF.
This becomes advantageous compared to simple assignment in cases wherein you want to perform a series of chained operations directly on the returned dataframe.
Consider the same DF sample demonstrated by #DSM:
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
df
Out[18]:
A B
0 1 2
1 2 3
2 3 4
df.assign(C="",D=np.nan)
Out[21]:
A B C D
0 1 2 NaN
1 2 3 NaN
2 3 4 NaN
Note that this returns a copy with all the previous columns along with the newly created ones. In order for the original DF to be modified accordingly, use it like : df = df.assign(...) as it does not support inplace operation currently.
if you want to add column name from a list
df=pd.DataFrame()
a=['col1','col2','col3','col4']
for i in a:
df[i]=np.nan
df["C"] = ""
df["D"] = np.nan
Assignment will give you this warning SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame. Try
using .loc[row_indexer,col_indexer] = value instead
so its better to use insert:
df.insert(index, column-name, column-value)
#emunsing's answer is really cool for adding multiple columns, but I couldn't get it to work for me in python 2.7. Instead, I found this works:
mydf = mydf.reindex(columns = np.append( mydf.columns.values, ['newcol1','newcol2'])
One can use df.insert(index_to_insert_at, column_header, init_value) to insert new column at a specific index.
cost_tbl.insert(1, "col_name", "")
The above statement would insert an empty Column after the first column.
this will also work for multiple columns:
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
>>> df
A B
0 1 2
1 2 3
2 3 4
df1 = pd.DataFrame(columns=['C','D','E'])
df = df.join(df1, how="outer")
>>>df
A B C D E
0 1 2 NaN NaN NaN
1 2 3 NaN NaN NaN
2 3 4 NaN NaN NaN
Then do whatever you want to do with the columns
pd.Series.fillna(),pd.Series.map()
etc.
The below code address the question "How do I add n number of empty columns to my existing dataframe". In the interest of keeping solutions to similar problems in one place, I am adding it here.
Approach 1 (to create 64 additional columns with column names from 1-64)
m = list(range(1,65,1))
dd=pd.DataFrame(columns=m)
df.join(dd).replace(np.nan,'') #df is the dataframe that already exists
Approach 2 (to create 64 additional columns with column names from 1-64)
df.reindex(df.columns.tolist() + list(range(1,65,1)), axis=1).replace(np.nan,'')
You can do
df['column'] = None #This works. This will create a new column with None type
df.column = None #This will work only when the column is already present in the dataframe
If you have a list of columns that you want to be empty, you can use assign, then comprehension dict, then dict unpacking.
>>> df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
>>> nan_cols_name = ["C","D","whatever"]
>>> df.assign(**{col:np.nan for col in nan_cols_name})
A B C D whatever
0 1 2 NaN NaN NaN
1 2 3 NaN NaN NaN
2 3 4 NaN NaN NaN
You can also unpack multiple dict in a dict that you unpack if you want different values for different columns.
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
nan_cols_name = ["C","D","whatever"]
empty_string_cols_name = ["E","F","bad column with space"]
df.assign(**{
**{col:np.nan for col in my_empy_columns_name},
**{col:"" for col in empty_string_cols_name}
}
)
Sorry for I did not explain my answer really well at beginning. There is another way to add an new column to an existing dataframe.
1st step, make a new empty data frame (with all the columns in your data frame, plus a new or few columns you want to add) called df_temp
2nd step, combine the df_temp and your data frame.
df_temp = pd.DataFrame(columns=(df_null.columns.tolist() + ['empty']))
df = pd.concat([df_temp, df])
It might be the best solution, but it is another way to think about this question.
the reason of I am using this method is because I am get this warning all the time:
: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df["empty1"], df["empty2"] = [np.nan, ""]
great I found the way to disable the Warning
pd.options.mode.chained_assignment = None
The reason I was looking for such a solution is simply to add spaces between multiple DFs which have been joined column-wise using the pd.concat function and then written to excel using xlsxwriter.
df[' ']=df.apply(lambda _: '', axis=1)
df_2 = pd.concat([df,df1],axis=1) #worked but only once.
# Note: df & df1 have the same rows which is my index.
#
df_2[' ']=df_2.apply(lambda _: '', axis=1) #didn't work this time !!?
df_4 = pd.concat([df_2,df_3],axis=1)
I then replaced the second lambda call with
df_2['']='' #which appears to add a blank column
df_4 = pd.concat([df_2,df_3],axis=1)
The output I tested it on was using xlsxwriter to excel.
Jupyter blank columns look the same as in excel although doesnt have xlsx formatting.
Not sure why the second Lambda call didnt work.

Categories

Resources