Python Pandas: I need to transform a pandas dataframe - python

the data frame looks like this. I have tried with pivot, stack, unstack. Is there any method to achieve the output
key attribute text_value numeric_value date_value
0 1 order NaN NaN 10/02/19
1 1 size NaN 43.0 NaN
2 1 weight NaN 22.0 NaN
3 1 price NaN 33.0 NaN
4 1 segment product NaN NaN
5 2 order NaN NaN 11/02/19
6 2 size NaN 34.0 NaN
7 2 weight NaN 32.0 NaN
8 2 price NaN 89.0 NaN
9 2 segment customer NaN NaN
I need the following output
key order size weight price segment
1 10/2/2019 43.0 22.0 33.0 product
2 11/2/2019 34.0 32.0 89.0 customer
Thanks in advance

I believe you dont want to change dtypes in output data, so possible solution is processing each column separately by DataFrame.dropna and DataFrame.pivot and then join together by concat:
df['date_value'] = pd.to_datetime(df['date_value'])
df1 = df.dropna(subset=['text_value']).pivot('key','attribute','text_value')
df2 = df.dropna(subset=['numeric_value']).pivot('key','attribute','numeric_value')
df3 = df.dropna(subset=['date_value']).pivot('key','attribute','date_value')
df = pd.concat([df1, df2, df3], axis=1).reindex(df['attribute'].unique(), axis=1)
print (df)
attribute order size weight price segment
key
1 2019-10-02 43.0 22.0 33.0 product
2 2019-11-02 34.0 32.0 89.0 customer
print (df.dtypes)
order datetime64[ns]
size float64
weight float64
price float64
segment object
dtype: object
Old answer - all values are casted to strings:
df['date_value'] = pd.to_datetime(df['date_value'])
df['text_value'] = df['text_value'].fillna(df['numeric_value']).fillna(df['date_value'])
df = df.pivot('key','attribute','text_value')
print (df)
attribute order price segment size weight
key
1 1569974400000000000 33 product 43 22
2 1572652800000000000 89 customer 34 32
print (df.dtypes)
order object
price object
segment object
size object
weight object
dtype: object

This is the solution, I figured out
attr_dict = {'order':'date_value', 'size':'numeric_value', 'weight':'numeric_value', 'price':'numeric_value', 'segment':'text_value'}
output_table = pd.DataFrame()
for attr in attr_dict.keys():
temp = input_table[input_table['attribute'] == attr][['key', attr_dict[attr]]]
temp.rename(columns={attr_dict[attr]:attr}, inplace=True)
output_table[attr] = list(temp.values[:,1])
output_table

Related

Reset index without multiple headers after pivot in pandas

I have this DataFrame
df = pd.DataFrame({'store':[1,1,1,2],'upc':[11,22,33,11],'sales':[14,16,11,29]})
which gives this output
store upc sales
0 1 11 14
1 1 22 16
2 1 33 11
3 2 11 29
I want something like this
store upc_11 upc_22 upc_33
1 14.0 16.0 11.0
2 29.0 NaN NaN
I tried this
newdf = df.pivot(index='store', columns='upc')
newdf.columns = newdf.columns.droplevel(0)
and the output looks like this with multiple headers
upc 11 22 33
store
1 14.0 16.0 11.0
2 29.0 NaN NaN
I also tried
newdf = df.pivot(index='store', columns='upc').reset_index()
This also gives multiple headers
store sales
upc 11 22 33
0 1 14.0 16.0 11.0
1 2 29.0 NaN NaN
try via fstring+columns attribute and list comprehension:
newdf = df.pivot(index='store', columns='upc')
newdf.columns=[f"upc_{y}" for x,y in newdf.columns]
newdf=newdf.reset_index()
OR
In 2 steps:
newdf = df.pivot(index='store', columns='upc').reset_index()
newdf.columns=[f"upc_{y}" if y!='' else f"{x}" for x,y in newdf.columns]
Another option, which is longer than #Anurag's:
(df.pivot(index='store', columns='upc')
.droplevel(axis=1, level=0)
.rename(columns = lambda df: f"upc_{df}")
.rename_axis(index=None, columns=None)
)
upc_11 upc_22 upc_33
1 14.0 16.0 11.0
2 29.0 NaN NaN

Pandas data frame compare and replace values

I have two pandas data frames like below. The column 'No' is a common field. Based on 'No', i want to replace values in first data frame column 'Total'.
Condition is : Where ever 'No' matches, get 'Marks1' value from dataframe2 and replace in 'Total' column. If 'Marks1' is NULL, then get 'Marks2' value and replace in 'Total'. If both (Marks1/Marks2) are null, replace with null in the 'Total' column.
The final result should be in data frame1. Both data frames has few hundred thousand records.
Data frame1
No|Total
1234|11
2515|21
3412|32
4854|
7732|53
Data frame2
No|Marks1|Marks2
1234|99|23
2515|98|31
3412||20
4854||98
7732||
Result :
No|Total
1234|99
2515|98
3412|20
4854|98
7732|
Use Series.map with replace missing values Marks1 by Marks2 with Series.fillna:
df = df2.set_index('No')
df1['Total'] = df1['No'].map(df['Marks1'].fillna(df['Marks2']))
print (df1)
No Total
0 1234 99.0
1 2515 98.0
2 3412 20.0
3 4854 98.0
4 7732 NaN
If possible duplicated values in No for df2 then use:
print (df2)
No Marks1 Marks2
0 1234 99.0 23.0 <- duplicated No
1 1234 98.0 31.0 <- duplicated No
2 3412 NaN 20.0
3 4854 NaN 98.0
4 7732 NaN NaN
#newer pandas versions
df = df2.set_index('No').sum(level=0, min_count=1)
#oldier pandas versions
#df = df2.set_index('No').sum(level=0)
print (df)
Marks1 Marks2
No
1234 197.0 54.0<- unique No, values are summed per index created by No
3412 NaN 20.0
4854 NaN 98.0
7732 NaN NaN
df1['Total'] = df1['No'].map(df['Marks1'].fillna(df['Marks2']))
print (df1)
No Total
0 1234 197.0
1 2515 NaN
2 3412 20.0
3 4854 98.0
4 7732 NaN
If there is same index values in df1 and df2 and each No values matched use:
df1['Total'] = df2['Marks1'].fillna(df2['Marks2'])
You can use np.select here.
m = df2['Marks1'].notna()
m1 = df2['Marks1'].isna() & df2['Marks2'].notna()
condlist = [m,m1]
choice = [df2['Marks1'] , df2['Marks2']]
df1['Total'] = np.select(condlist,choice,np.nan)
No Total
0 1234 99.0
1 2515 98.0
2 3412 20.0
3 4854 98.0
4 7732 NaN

How to change NaNs in dataframe to 0? [duplicate]

I have a Pandas Dataframe as below:
itm Date Amount
67 420 2012-09-30 00:00:00 65211
68 421 2012-09-09 00:00:00 29424
69 421 2012-09-16 00:00:00 29877
70 421 2012-09-23 00:00:00 30990
71 421 2012-09-30 00:00:00 61303
72 485 2012-09-09 00:00:00 71781
73 485 2012-09-16 00:00:00 NaN
74 485 2012-09-23 00:00:00 11072
75 485 2012-09-30 00:00:00 113702
76 489 2012-09-09 00:00:00 64731
77 489 2012-09-16 00:00:00 NaN
When I try to apply a function to the Amount column, I get the following error:
ValueError: cannot convert float NaN to integer
I have tried applying a function using .isnan from the Math Module
I have tried the pandas .replace attribute
I tried the .sparse data attribute from pandas 0.9
I have also tried if NaN == NaN statement in a function.
I have also looked at this article How do I replace NA values with zeros in an R dataframe? whilst looking at some other articles.
All the methods I have tried have not worked or do not recognise NaN.
Any Hints or solutions would be appreciated.
I believe DataFrame.fillna() will do this for you.
Link to Docs for a dataframe and for a Series.
Example:
In [7]: df
Out[7]:
0 1
0 NaN NaN
1 -0.494375 0.570994
2 NaN NaN
3 1.876360 -0.229738
4 NaN NaN
In [8]: df.fillna(0)
Out[8]:
0 1
0 0.000000 0.000000
1 -0.494375 0.570994
2 0.000000 0.000000
3 1.876360 -0.229738
4 0.000000 0.000000
To fill the NaNs in only one column, select just that column. in this case I'm using inplace=True to actually change the contents of df.
In [12]: df[1].fillna(0, inplace=True)
Out[12]:
0 0.000000
1 0.570994
2 0.000000
3 -0.229738
4 0.000000
Name: 1
In [13]: df
Out[13]:
0 1
0 NaN 0.000000
1 -0.494375 0.570994
2 NaN 0.000000
3 1.876360 -0.229738
4 NaN 0.000000
EDIT:
To avoid a SettingWithCopyWarning, use the built in column-specific functionality:
df.fillna({1:0}, inplace=True)
It is not guaranteed that the slicing returns a view or a copy. You can do
df['column'] = df['column'].fillna(value)
You could use replace to change NaN to 0:
import pandas as pd
import numpy as np
# for column
df['column'] = df['column'].replace(np.nan, 0)
# for whole dataframe
df = df.replace(np.nan, 0)
# inplace
df.replace(np.nan, 0, inplace=True)
The below code worked for me.
import pandas
df = pandas.read_csv('somefile.txt')
df = df.fillna(0)
I just wanted to provide a bit of an update/special case since it looks like people still come here. If you're using a multi-index or otherwise using an index-slicer the inplace=True option may not be enough to update the slice you've chosen. For example in a 2x2 level multi-index this will not change any values (as of pandas 0.15):
idx = pd.IndexSlice
df.loc[idx[:,mask_1],idx[mask_2,:]].fillna(value=0,inplace=True)
The "problem" is that the chaining breaks the fillna ability to update the original dataframe. I put "problem" in quotes because there are good reasons for the design decisions that led to not interpreting through these chains in certain situations. Also, this is a complex example (though I really ran into it), but the same may apply to fewer levels of indexes depending on how you slice.
The solution is DataFrame.update:
df.update(df.loc[idx[:,mask_1],idx[[mask_2],:]].fillna(value=0))
It's one line, reads reasonably well (sort of) and eliminates any unnecessary messing with intermediate variables or loops while allowing you to apply fillna to any multi-level slice you like!
If anybody can find places this doesn't work please post in the comments, I've been messing with it and looking at the source and it seems to solve at least my multi-index slice problems.
You can also use dictionaries to fill NaN values of the specific columns in the DataFrame rather to fill all the DF with some oneValue.
import pandas as pd
df = pd.read_excel('example.xlsx')
df.fillna( {
'column1': 'Write your values here',
'column2': 'Write your values here',
'column3': 'Write your values here',
'column4': 'Write your values here',
.
.
.
'column-n': 'Write your values here'} , inplace=True)
Easy way to fill the missing values:-
filling string columns: when string columns have missing values and NaN values.
df['string column name'].fillna(df['string column name'].mode().values[0], inplace = True)
filling numeric columns: when the numeric columns have missing values and NaN values.
df['numeric column name'].fillna(df['numeric column name'].mean(), inplace = True)
filling NaN with zero:
df['column name'].fillna(0, inplace = True)
To replace na values in pandas
df['column_name'].fillna(value_to_be_replaced,inplace=True)
if inplace = False, instead of updating the df (dataframe) it will return the modified values.
Considering the particular column Amount in the above table is of integer type. The following would be a solution :
df['Amount'] = df.Amount.fillna(0).astype(int)
Similarly, you can fill it with various data types like float, str and so on.
In particular, I would consider datatype to compare various values of the same column.
There have been many contributions already, but since I'm new here, I will still give input.
There are two approaches to replace NaN values with zeros in Pandas DataFrame:
fillna(): function fills NA/NaN values using the specified method.
replace(): df.replace()a simple method used to replace a string, regex, list, dictionary
Example:
#NaN with zero on all columns
df2 = df.fillna(0)
#Using the inplace=True keyword in a pandas method changes the default behaviour.
df.fillna(0, inplace = True)
# multiple columns appraoch
df[["Student", "ID"]] = df[["Student", "ID"]].fillna(0)
finally the replace() method :
df["Student"] = df["Student"].replace(np.nan, 0)
Replace all nan with 0
df = df.fillna(0)
To replace nan in different columns with different ways:
replacement= {'column_A': 0, 'column_B': -999, 'column_C': -99999}
df.fillna(value=replacement)
This works for me, but no one's mentioned it. could there be something wrong with it?
df.loc[df['column_name'].isnull(), 'column_name'] = 0
There are two options available primarily; in case of imputation or filling of missing values NaN / np.nan with only numerical replacements (across column(s):
df['Amount'].fillna(value=None, method= ,axis=1,) is sufficient:
From the Documentation:
value : scalar, dict, Series, or DataFrame
Value to use to fill holes (e.g. 0), alternately a
dict/Series/DataFrame of values specifying which value to use for
each index (for a Series) or column (for a DataFrame). (values not
in the dict/Series/DataFrame will not be filled). This value cannot
be a list.
Which means 'strings' or 'constants' are no longer permissable to be imputed.
For more specialized imputations use SimpleImputer():
from sklearn.impute import SimpleImputer
si = SimpleImputer(strategy='constant', missing_values=np.nan, fill_value='Replacement_Value')
df[['Col-1', 'Col-2']] = si.fit_transform(X=df[['C-1', 'C-2']])
If you were to convert it to a pandas dataframe, you can also accomplish this by using fillna.
import numpy as np
df=np.array([[1,2,3, np.nan]])
import pandas as pd
df=pd.DataFrame(df)
df.fillna(0)
This will return the following:
0 1 2 3
0 1.0 2.0 3.0 NaN
>>> df.fillna(0)
0 1 2 3
0 1.0 2.0 3.0 0.0
If you want to fill NaN for a specific column you can use loc:
d1 = {"Col1" : ['A', 'B', 'C'],
"fruits": ['Avocado', 'Banana', 'NaN']}
d1= pd.DataFrame(d1)
output:
Col1 fruits
0 A Avocado
1 B Banana
2 C NaN
d1.loc[ d1.Col1=='C', 'fruits' ] = 'Carrot'
output:
Col1 fruits
0 A Avocado
1 B Banana
2 C Carrot
I think it's also worth mention and explain
the parameters configuration of fillna()
like Method, Axis, Limit, etc.
From the documentation we have:
Series.fillna(value=None, method=None, axis=None,
inplace=False, limit=None, downcast=None)
Fill NA/NaN values using the specified method.
Parameters
value [scalar, dict, Series, or DataFrame] Value to use to
fill holes (e.g. 0), alternately a dict/Series/DataFrame
of values specifying which value to use for each index
(for a Series) or column (for a DataFrame). Values not in
the dict/Series/DataFrame will not be filled. This
value cannot be a list.
method [{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None},
default None] Method to use for filling holes in
reindexed Series pad / ffill: propagate last valid
observation forward to next valid backfill / bfill:
use next valid observation to fill gap axis
[{0 or ‘index’}] Axis along which to fill missing values.
inplace [bool, default False] If True, fill
in-place. Note: this will modify any other views
on this object (e.g., a no-copy slice for a
column in a DataFrame).
limit [int,defaultNone] If method is specified,
this is the maximum number of consecutive NaN
values to forward/backward fill. In other words,
if there is a gap with more than this number of
consecutive NaNs, it will only be partially filled.
If method is not specified, this is the maximum
number of entries along the entire axis where NaNs
will be filled. Must be greater than 0 if not None.
downcast [dict, default is None] A dict of item->dtype
of what to downcast if possible, or the string ‘infer’
which will try to downcast to an appropriate equal
type (e.g. float64 to int64 if possible).
Ok. Let's start with the method= Parameter this
have forward fill (ffill) and backward fill(bfill)
ffill is doing copying forward the previous
non missing value.
e.g. :
import pandas as pd
import numpy as np
inp = [{'c1':10, 'c2':np.nan, 'c3':200}, {'c1':np.nan,'c2':110, 'c3':210}, {'c1':12,'c2':np.nan, 'c3':220},{'c1':12,'c2':130, 'c3':np.nan},{'c1':12,'c2':np.nan, 'c3':240}]
df = pd.DataFrame(inp)
c1 c2 c3
0 10.0 NaN 200.0
1 NaN 110.0 210.0
2 12.0 NaN 220.0
3 12.0 130.0 NaN
4 12.0 NaN 240.0
Forward fill:
df.fillna(method="ffill")
c1 c2 c3
0 10.0 NaN 200.0
1 10.0 110.0 210.0
2 12.0 110.0 220.0
3 12.0 130.0 220.0
4 12.0 130.0 240.0
Backward fill:
df.fillna(method="bfill")
c1 c2 c3
0 10.0 110.0 200.0
1 12.0 110.0 210.0
2 12.0 130.0 220.0
3 12.0 130.0 240.0
4 12.0 NaN 240.0
The Axis Parameter help us to choose the direction of the fill:
Fill directions:
ffill:
Axis = 1
Method = 'ffill'
----------->
direction
df.fillna(method="ffill", axis=1)
c1 c2 c3
0 10.0 10.0 200.0
1 NaN 110.0 210.0
2 12.0 12.0 220.0
3 12.0 130.0 130.0
4 12.0 12.0 240.0
Axis = 0 # by default
Method = 'ffill'
|
| # direction
|
V
e.g: # This is the ffill default
df.fillna(method="ffill", axis=0)
c1 c2 c3
0 10.0 NaN 200.0
1 10.0 110.0 210.0
2 12.0 110.0 220.0
3 12.0 130.0 220.0
4 12.0 130.0 240.0
bfill:
axis= 0
method = 'bfill'
^
|
|
|
df.fillna(method="bfill", axis=0)
c1 c2 c3
0 10.0 110.0 200.0
1 12.0 110.0 210.0
2 12.0 130.0 220.0
3 12.0 130.0 240.0
4 12.0 NaN 240.0
axis = 1
method = 'bfill'
<-----------
df.fillna(method="bfill", axis=1)
c1 c2 c3
0 10.0 200.0 200.0
1 110.0 110.0 210.0
2 12.0 220.0 220.0
3 12.0 130.0 NaN
4 12.0 240.0 240.0
# alias:
# 'fill' == 'pad'
# bfill == backfill
limit parameter:
df
c1 c2 c3
0 10.0 NaN 200.0
1 NaN 110.0 210.0
2 12.0 NaN 220.0
3 12.0 130.0 NaN
4 12.0 NaN 240.0
Only replace the first NaN element across columns:
df.fillna(value = 'Unavailable', limit=1)
c1 c2 c3
0 10.0 Unavailable 200.0
1 Unavailable 110.0 210.0
2 12.0 NaN 220.0
3 12.0 130.0 Unavailable
4 12.0 NaN 240.0
df.fillna(value = 'Unavailable', limit=2)
c1 c2 c3
0 10.0 Unavailable 200.0
1 Unavailable 110.0 210.0
2 12.0 Unavailable 220.0
3 12.0 130.0 Unavailable
4 12.0 NaN 240.0
downcast parameter:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 c1 4 non-null float64
1 c2 2 non-null float64
2 c3 4 non-null float64
dtypes: float64(3)
memory usage: 248.0 bytes
df.fillna(method="ffill",downcast='infer').info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 c1 5 non-null int64
1 c2 4 non-null float64
2 c3 5 non-null int64
dtypes: float64(1), int64(2)
memory usage: 248.0 bytes

How to create a Function in python for detecting missing value?

Name Sex Age Ticket_No Fare
0 Braund male 22 HN07681 2500
1 NaN female 42 HN05681 6895
2 peter male NaN KKSN55 800
3 NaN male 56 HN07681 2500
4 Daisy female 22 hf55s44 NaN
5 Manson NaN 48 HN07681 8564
6 Piston male NaN HN07681 5622
7 Racline female 42 Nh55146 NaN
8 Nan male 22 HN07681 4875
9 NaN NaN NaN NaN NaN
col_Name No_of_Missing Mean Median Mode
0 Name 3 NaN NaN NaN
1 Sex 1 NaN NaN NaN
2 Age 2 36 42 22
3 Fare 2 4536 4875 2500
Mean/Median/Mode is only for numerical datatype, otherwise should be null.
Try this:
# Your original df
print(df)
# First drop any rows which are completely NaN
df = df.dropna(how = "all")
# Create a list to hold other lists.
# This will be used as the data for the new dataframe
new_data = []
# Parse through the columns
for col in df.columns:
# Create a new list, which will be one row of data in the new dataframe
# The first item containing only the columns name,
# to correspond with the new df's first column
_list = [col]
_list.append(df.dtypes[col]) # DType for that colmn is the second item/second column
missing = df[col].isna().sum() # Total the number of "NaN" in column
if missing > 30:
print("Max total number of missing exceeded")
continue # Skip this columns and continue on to next column
_list.append(missing)
# Get the mean This will error and pass if it's not possible
try: mean = df[col].mean()
except:
mean = np.nan
_list.append(mean) # Append to proper columns position
# Get the median This will error and pass if it's not possible
try: median = df[col].median()
except:
median = np.nan
_list.append(median)
# Get the mode. This will error and pass if it's not possible
try: mode = df[col].mode()[1]
except:
mode = np.nan
_list.append(mode)
new_data.append(_list)
columns = ["col_Name", "DType", "No_of_Missing", "Mean", "Median", "Mode"]
new_df = pd.DataFrame(new_data, columns = columns)
print("============================")
print(new_df)
OUTPUT:
Name Sex Age Ticket_No Fare
0 Braund male 22.0 HN07681 2500.0
1 NaN female 42.0 HN05681 6895.0
2 peter male NaN KKSN55 800.0
3 NaN male 56.0 HN07681 2500.0
4 Daisy female 22.0 hf55s44 NaN
5 Manson NaN 48.0 HN07681 8564.0
6 Piston male NaN HN07681 5622.0
7 Racline female 42.0 Nh55146 NaN
8 NaN male 22.0 HN07681 4875.0
9 NaN NaN NaN NaN NaN
============================
col_Name DType No_of_Missing Mean Median Mode
0 Name object 3 NaN NaN Daisy
1 Sex object 1 NaN NaN NaN
2 Age float64 2 36.285714 42.0 NaN
3 Ticket_No object 0 NaN NaN NaN
4 Fare float64 2 4536.571429 4875.0 NaN

Move values in rows in a new column in pandas

I have a DataFrame with an Ids column an several columns with data, like the column "value" in this example.
For this DataFrame I want to move all the values that correspond to the same id to a new column in the row as shown below:
I guess there is an opposite function to "melt" that allow this, but I'm not getting how to pivot this DF.
The dicts for the input and out DFs are:
d = {"id":[1,1,1,2,2,3,3,4,5],"value":[12,13,1,22,21,23,53,64,9]}
d2 = {"id":[1,2,3,4,5],"value1":[12,22,23,64,9],"value2":[1,21,53,"","",],"value3":[1,"","","",""]}
Create MultiIndex by cumcount, reshape by unstack and add change columns names by add_prefix:
df = (df.set_index(['id',df.groupby('id').cumcount()])['value']
.unstack()
.add_prefix('value')
.reset_index())
print (df)
id value0 value1 value2
0 1 12.0 13.0 1.0
1 2 22.0 21.0 NaN
2 3 23.0 53.0 NaN
3 4 64.0 NaN NaN
4 5 9.0 NaN NaN
Missing values is possible replace by fillna, but get mixed numeric with strings data, so some function should failed:
df = (df.set_index(['id',df.groupby('id').cumcount()])['value']
.unstack()
.add_prefix('value')
.reset_index()
.fillna(''))
print (df)
id value0 value1 value2
0 1 12.0 13 1
1 2 22.0 21
2 3 23.0 53
3 4 64.0
4 5 9.0
You can GroupBy to a list, then expand the series of lists:
df = pd.DataFrame(d) # create input dataframe
res = df.groupby('id')['value'].apply(list).reset_index() # groupby to list
res = res.join(pd.DataFrame(res.pop('value').values.tolist())) # expand lists to columns
print(res)
id 0 1 2
0 1 12 13.0 1.0
1 2 22 21.0 NaN
2 3 23 53.0 NaN
3 4 64 NaN NaN
4 5 9 NaN NaN
In general, such operations will be expensive as the number of columns is arbitrary. Pandas / NumPy solutions work best when you can pre-allocate memory, which isn't possible here.

Categories

Resources