How can I properly use pivot on this pandas dataframe? - python

I have the following df:
Item Service Damage Type Price
A Fast 3.5 1 15.48403728
A Slow 3.5 1 17.41954194
B Fast 5 1 19.3550466
B Slow 5 1 21.29055126
C Fast 5.5 1 23.22605592
and so on
I want to turn this into this format:
Item Damage Type Price_Fast Price_slow
So the first row would be:
Item Damage Type Price_Fast Price_slow
A 3.5 1 15.4840.. 17.41954...
I tried:
df.pivot(index=['Item', 'Damage', 'Type'],columns='Service', values='Price')
but it threw this error:
ValueError: Length of passed values is 2340, index implies 3

To get exactly the dataframe layout you want use
dfData = dfRaw.pivot_table(index=['Item', 'Damage', 'Type'],columns='Service', values='Price')
like #CJR suggested followed by
dfData.reset_index(inplace=True)
to flatten dataframe and
dfData.rename(columns={'Fast': 'Price_fast'}, inplace=True)
dfData.rename(columns={'Slow': 'Price_slow'}, inplace=True)
to get your desired column names.
Then use
dfNew.columns = dfNew.columns.values
to get rid of custom index label and your are done (Thanks to #Akaisteph7 for pointing that out that I was not quite done with my previous solution.)

You can do it with the following code:
# You should use pivot_table as it handles multiple column pivoting and duplicates aggregation
df2 = df.pivot_table(index=['Item', 'Damage', 'Type'], columns='Service', values='Price')
# Make the pivot indexes back into columns
df2.reset_index(inplace=True)
# Change the columns' names
df2.rename(columns=lambda x: "Price_"+x if x in ["Fast", "Slow"] else x, inplace=True)
# Remove the unneeded column Index name
df2.columns = df2.columns.values
print(df2)
Output:
Item Damage Type Price_Fast Price_Slow
0 A 3.5 1 15.484037 17.419542
1 B 5.0 1 19.355047 21.290551
2 C 5.5 1 23.226056 NaN

Related

Adding a new column in pandas dataframe from another dataframe with differing indices

This is my original dataframe.
This is my second dataframe containing one column.
I want to add the column of second dataframe to the original dataframe at the end. Indices are different for both dataframes. I did like this.
df1['RESULT'] = df2['RESULT']
It doesn't return an error and the column is added but all values are NaNs. How do I add these columns with their values?
Assuming the size of your dataframes are the same, you can assign the RESULT_df['RESULT'].values to your original dataframe. This way, you don't have to worry about indexing issues.
# pre 0.24
feature_file_df['RESULT'] = RESULT_df['RESULT'].values
# >= 0.24
feature_file_df['RESULT'] = RESULT_df['RESULT'].to_numpy()
Minimal Code Sample
df
A B
0 -1.202564 2.786483
1 0.180380 0.259736
2 -0.295206 1.175316
3 1.683482 0.927719
4 -0.199904 1.077655
df2
C
11 -0.140670
12 1.496007
13 0.263425
14 -0.557958
15 -0.018375
Let's try direct assignment first.
df['C'] = df2['C']
df
A B C
0 -1.202564 2.786483 NaN
1 0.180380 0.259736 NaN
2 -0.295206 1.175316 NaN
3 1.683482 0.927719 NaN
4 -0.199904 1.077655 NaN
Now, assign the array returned by .values (or .to_numpy() for pandas versions >0.24). .values returns a numpy array which does not have an index.
df2['C'].values
array([-0.141, 1.496, 0.263, -0.558, -0.018])
df['C'] = df2['C'].values
df
A B C
0 -1.202564 2.786483 -0.140670
1 0.180380 0.259736 1.496007
2 -0.295206 1.175316 0.263425
3 1.683482 0.927719 -0.557958
4 -0.199904 1.077655 -0.018375
You can also call set_axis() to change the index of a dataframe/column. So if the lengths are the same, then with set_axis(), you can coerce the index of one dataframe to be the same as the other dataframe.
df1['A'] = df2['A'].set_axis(df1.index)
If you get SettingWithCopyWarning, then to silence it, you can create a copy by either calling join() or assign().
df1 = df1.join(df2['A'].set_axis(df1.index))
# or
df1 = df1.assign(new_col = df2['A'].set_axis(df1.index))
set_axis() is especially useful if you want to add multiple columns from another dataframe. You can just call join() after calling it on the new dataframe.
df1 = df1.join(df2[['A', 'B', 'C']].set_axis(df1.index))

adding values in new column based on indexes with pandas in python

I'm just getting into pandas and I am trying to add a new column to an existing dataframe.
I have two dataframes where the index of one data frame links to a column in another dataframe. Where these values are equal I need to put the value of another column in the source dataframe in a new column of the destination column.
The code section below illustrates what I mean. The commented part is what I need as an output.
I guess I need the .loc[] function.
Another, minor, question: is it bad practice to have a non-unique indexes?
import pandas as pd
d = {'key':['a', 'b', 'c'],
'bar':[1, 2, 3]}
d2 = {'key':['a', 'a', 'b'],
'other_data':['10', '20', '30']}
df = pd.DataFrame(d)
df2 = pd.DataFrame(data = d2)
df2 = df2.set_index('key')
print df2
## other_data new_col
##key
##a 10 1
##a 20 1
##b 30 2
Use rename index by Series:
df2['new'] = df2.rename(index=df.set_index('key')['bar']).index
print (df2)
other_data new
key
a 10 1
a 20 1
b 30 2
Or map:
df2['new'] = df2.index.to_series().map(df.set_index('key')['bar'])
print (df2)
other_data new
key
a 10 1
a 20 1
b 30 2
If want better performance, the best is avoid duplicates in index. Also some function like reindex failed in duplicates index.
You can use join
df2.join(df.set_index('key'))
other_data bar
key
a 10 1
a 20 1
b 30 2
One way to rename the column in the process
df2.join(df.set_index('key').bar.rename('new'))
other_data new
key
a 10 1
a 20 1
b 30 2
Another, minor, question: is it bad practice to have a non-unique
indexes?
It is not great practice, but depends on your needs and can be okay in some circumstances.
Issue 1: join operations
A good place to start is to think about what makes an Index different from a standard DataFrame column. This engenders the question: if your Index has duplicate values, does it really need to be specified as an Index, or could it just be another column in a RangeIndex-ed DataFrame? If you've ever used SQL or any other DMBS and want to mimic join operations in pandas with functions such as .join or .merge, you'll lose the functionality of a primary key if you have duplicate index values. A merge will give you what is basically a cartesian product--probably not what you're looking for.
For example:
df = pd.DataFrame(np.random.randn(10,2),
index=2*list('abcde'))
df2 = df.rename(columns={0: 'a', 1 : 'b'})
print(df.merge(df2, left_index=True, right_index=True).head(7))
0 1 a b
a 0.73737 1.49073 0.73737 1.49073
a 0.73737 1.49073 -0.25562 -2.79859
a -0.25562 -2.79859 0.73737 1.49073
a -0.25562 -2.79859 -0.25562 -2.79859
b -0.93583 1.17583 -0.93583 1.17583
b -0.93583 1.17583 -1.77153 -0.69988
b -1.77153 -0.69988 -0.93583 1.17583
Issue 2: performance
Unique-valued indices make certain operations efficient, as explained in this post.
When index is unique, pandas use a hashtable to map key to value O(1).
When index is non-unique and sorted, pandas use binary search O(logN),
when index is random ordered pandas need to check all the keys in the
index O(N).
A word on .loc
Using .loc will return all instances of the label. This can be a blessing or a curse depending on what your objective is. For example,
df = pd.DataFrame(np.random.randn(10,2),
index=2*list('abcde'))
print(df.loc['a'])
0 1
a 0.73737 1.49073
a -0.25562 -2.79859
With the help of .loc
df2['new'] = df.set_index('key').loc[df2.index]
Output :
other_data new
key
a 10 1
a 20 1
b 30 2
Using combine_first
In [442]: df2.combine_first(df.set_index('key')).dropna()
Out[442]:
bar other_data
key
a 1.0 10
a 1.0 20
b 2.0 30
Or, using map
In [461]: df2.assign(bar=df2.index.to_series().map(df.set_index('key')['bar']))
Out[461]:
other_data bar
key
a 10 1
a 20 1
b 30 2

Number of rows changes even after `pandas.merge` with `left` option

I am merging two data frames using pandas.merge. Even after specifying how = left option, I found the number of rows of merged data frame is larger than the original. Why does this happen?
panel = pd.read_csv(file1, encoding ='cp932')
before_len = len(panel)
prof_2000 = pd.read_csv(file2, encoding ='cp932').drop_duplicates()
temp_2000 = pd.merge(panel, prof_2000, left_on='Candidate_u', right_on="name2", how="left")
after_len = len(temp_2000)
print(before_len, after_len)
> 12661 13915
This sounds like having more than one rows in right under 'name2' that match the key you have set for the left. Using option 'how='left' with pandas.DataFrame.merge() only means that:
left: use only keys from left frame
However, the actual number of rows in the result object is not necessarily going to be the same as the number of rows in the left object.
Example:
In [359]: df_1
Out[359]:
A B
0 a AAA
1 b BBA
2 c CCF
and then another DF that looks like this (notice that there are more than one entry for your desired key on the left):
In [360]: df_3
Out[360]:
key value
0 a 1
1 a 2
2 b 3
3 a 4
If I merge these two on left.A, here's what happens:
In [361]: df_1.merge(df_3, how='left', left_on='A', right_on='key')
Out[361]:
A B key value
0 a AAA a 1.0
1 a AAA a 2.0
2 a AAA a 4.0
3 b BBA b 3.0
4 c CCF NaN NaN
This happened even though I merged with how='left' as you can see above, there were simply more than one rows to merge and as shown here the result pd.DataFrame has in fact more rows than the pd.DataFrame on the left.
I hope this helps!
The problem of doubling of rows after each merge() (of any type, 'both' or 'left') is usually caused by duplicates in any of the keys, so we need to drop them first:
left_df.drop_duplicates(subset=left_key, inplace=True)
right_df.drop_duplicates(subset=right_key, inplace=True)
If you do not have any duplication, as indicated in the above answer. You should double-check the names of removed entries. In my case, I discovered that the names of removed entries are inconsistent between the df1 and df2 and I solved the problem by:
df1["col1"] = df2["col2"]

Added column to existing dataframe but entered all numbers as NaN

So I created two dataframes from existing CSV files, both consisting of entirely numbers. The second dataframe consists of an index from 0 to 8783 and one column of numbers and I want to add it on as a new column to the first dataframe which has an index consisting of a month, day and hour. I tried using append, merge and concat and none worked and then tried simply using:
x1GBaverage['Power'] = x2_cut
where x1GBaverage is the first dataframe and x2_cut is the second. When I did this it added x2_cut on properly but all the values were entered as NaN instead of the numerical values that they should be. How should I be approaching this?
x1GBaverage['Power'] = x2_cut.values
problem solved :)
The thing about pandas is that values are implicitly linked to their indices unless you deliberately specify that you only need the values to be transferred over.
If they're the same row counts and you just want to tack it on the end, the indexes either need to match, or you need to just pass the underlying values. In the example below, columns 3 and 5 are the index matching & value versions, and 4 is what you're running into now:
In [58]: df = pd.DataFrame(np.random.random((3,3)))
In [59]: df
Out[59]:
0 1 2
0 0.670812 0.500688 0.136661
1 0.185841 0.239175 0.542369
2 0.351280 0.451193 0.436108
In [61]: df2 = pd.DataFrame(np.random.random((3,1)))
In [62]: df2
Out[62]:
0
0 0.638216
1 0.477159
2 0.205981
In [64]: df[3] = df2
In [66]: df.index = ['a', 'b', 'c']
In [68]: df[4] = df2
In [70]: df[5] = df2.values
In [71]: df
Out[71]:
0 1 2 3 4 5
a 0.670812 0.500688 0.136661 0.638216 NaN 0.638216
b 0.185841 0.239175 0.542369 0.477159 NaN 0.477159
c 0.351280 0.451193 0.436108 0.205981 NaN 0.205981
If the row counts differ, you'll need to use df.merge and let it know which columns it should be using to join the two frames.

How to add an empty column to a dataframe?

What's the easiest way to add an empty column to a pandas DataFrame object? The best I've stumbled upon is something like
df['foo'] = df.apply(lambda _: '', axis=1)
Is there a less perverse method?
If I understand correctly, assignment should fill:
>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
>>> df
A B
0 1 2
1 2 3
2 3 4
>>> df["C"] = ""
>>> df["D"] = np.nan
>>> df
A B C D
0 1 2 NaN
1 2 3 NaN
2 3 4 NaN
To add to DSM's answer and building on this associated question, I'd split the approach into two cases:
Adding a single column: Just assign empty values to the new columns, e.g. df['C'] = np.nan
Adding multiple columns: I'd suggest using the .reindex(columns=[...]) method of pandas to add the new columns to the dataframe's column index. This also works for adding multiple new rows with .reindex(rows=[...]). Note that newer versions of Pandas (v>0.20) allow you to specify an axis keyword rather than explicitly assigning to columns or rows.
Here is an example adding multiple columns:
mydf = mydf.reindex(columns = mydf.columns.tolist() + ['newcol1','newcol2'])
or
mydf = mydf.reindex(mydf.columns.tolist() + ['newcol1','newcol2'], axis=1) # version > 0.20.0
You can also always concatenate a new (empty) dataframe to the existing dataframe, but that doesn't feel as pythonic to me :)
I like:
df['new'] = pd.Series(dtype='int')
# or use other dtypes like 'float', 'object', ...
If you have an empty dataframe, this solution makes sure that no new row containing only NaN is added.
Specifying dtype is not strictly necessary, however newer Pandas versions produce a DeprecationWarning if not specified.
an even simpler solution is:
df = df.reindex(columns = header_list)
where "header_list" is a list of the headers you want to appear.
any header included in the list that is not found already in the dataframe will be added with blank cells below.
so if
header_list = ['a','b','c', 'd']
then c and d will be added as columns with blank cells
Starting with v0.16.0, DF.assign() could be used to assign new columns (single/multiple) to a DF. These columns get inserted in alphabetical order at the end of the DF.
This becomes advantageous compared to simple assignment in cases wherein you want to perform a series of chained operations directly on the returned dataframe.
Consider the same DF sample demonstrated by #DSM:
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
df
Out[18]:
A B
0 1 2
1 2 3
2 3 4
df.assign(C="",D=np.nan)
Out[21]:
A B C D
0 1 2 NaN
1 2 3 NaN
2 3 4 NaN
Note that this returns a copy with all the previous columns along with the newly created ones. In order for the original DF to be modified accordingly, use it like : df = df.assign(...) as it does not support inplace operation currently.
if you want to add column name from a list
df=pd.DataFrame()
a=['col1','col2','col3','col4']
for i in a:
df[i]=np.nan
df["C"] = ""
df["D"] = np.nan
Assignment will give you this warning SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame. Try
using .loc[row_indexer,col_indexer] = value instead
so its better to use insert:
df.insert(index, column-name, column-value)
#emunsing's answer is really cool for adding multiple columns, but I couldn't get it to work for me in python 2.7. Instead, I found this works:
mydf = mydf.reindex(columns = np.append( mydf.columns.values, ['newcol1','newcol2'])
One can use df.insert(index_to_insert_at, column_header, init_value) to insert new column at a specific index.
cost_tbl.insert(1, "col_name", "")
The above statement would insert an empty Column after the first column.
this will also work for multiple columns:
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
>>> df
A B
0 1 2
1 2 3
2 3 4
df1 = pd.DataFrame(columns=['C','D','E'])
df = df.join(df1, how="outer")
>>>df
A B C D E
0 1 2 NaN NaN NaN
1 2 3 NaN NaN NaN
2 3 4 NaN NaN NaN
Then do whatever you want to do with the columns
pd.Series.fillna(),pd.Series.map()
etc.
The below code address the question "How do I add n number of empty columns to my existing dataframe". In the interest of keeping solutions to similar problems in one place, I am adding it here.
Approach 1 (to create 64 additional columns with column names from 1-64)
m = list(range(1,65,1))
dd=pd.DataFrame(columns=m)
df.join(dd).replace(np.nan,'') #df is the dataframe that already exists
Approach 2 (to create 64 additional columns with column names from 1-64)
df.reindex(df.columns.tolist() + list(range(1,65,1)), axis=1).replace(np.nan,'')
You can do
df['column'] = None #This works. This will create a new column with None type
df.column = None #This will work only when the column is already present in the dataframe
If you have a list of columns that you want to be empty, you can use assign, then comprehension dict, then dict unpacking.
>>> df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
>>> nan_cols_name = ["C","D","whatever"]
>>> df.assign(**{col:np.nan for col in nan_cols_name})
A B C D whatever
0 1 2 NaN NaN NaN
1 2 3 NaN NaN NaN
2 3 4 NaN NaN NaN
You can also unpack multiple dict in a dict that you unpack if you want different values for different columns.
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
nan_cols_name = ["C","D","whatever"]
empty_string_cols_name = ["E","F","bad column with space"]
df.assign(**{
**{col:np.nan for col in my_empy_columns_name},
**{col:"" for col in empty_string_cols_name}
}
)
Sorry for I did not explain my answer really well at beginning. There is another way to add an new column to an existing dataframe.
1st step, make a new empty data frame (with all the columns in your data frame, plus a new or few columns you want to add) called df_temp
2nd step, combine the df_temp and your data frame.
df_temp = pd.DataFrame(columns=(df_null.columns.tolist() + ['empty']))
df = pd.concat([df_temp, df])
It might be the best solution, but it is another way to think about this question.
the reason of I am using this method is because I am get this warning all the time:
: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df["empty1"], df["empty2"] = [np.nan, ""]
great I found the way to disable the Warning
pd.options.mode.chained_assignment = None
The reason I was looking for such a solution is simply to add spaces between multiple DFs which have been joined column-wise using the pd.concat function and then written to excel using xlsxwriter.
df[' ']=df.apply(lambda _: '', axis=1)
df_2 = pd.concat([df,df1],axis=1) #worked but only once.
# Note: df & df1 have the same rows which is my index.
#
df_2[' ']=df_2.apply(lambda _: '', axis=1) #didn't work this time !!?
df_4 = pd.concat([df_2,df_3],axis=1)
I then replaced the second lambda call with
df_2['']='' #which appears to add a blank column
df_4 = pd.concat([df_2,df_3],axis=1)
The output I tested it on was using xlsxwriter to excel.
Jupyter blank columns look the same as in excel although doesnt have xlsx formatting.
Not sure why the second Lambda call didnt work.

Categories

Resources