Manage the missing value in a dataframe with string and number

Manage the missing value in a dataframe with string and number - python

I have a dataframe with some string columns and number columns. I want to manage the missing values.
I want to change the "nan" values with mean of each row.
I saw the different question in this website, however, they are different from my question. Like this link: Pandas Dataframe: Replacing NaN with row average
If all the values of a rows are "Nan" values, I want to delete that rows. I have also provide a sample case as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['id'] = ['a', 'b', 'c', 'n']
df['md'] = ['d', 'e', 'f', 'l']
df['c1'] = [2, np.nan,np.nan, 5]
df['c2'] = [0, 5, np.nan, 3]
df['c3'] = [8, 7, np.nan,np.nan]
df = pd.DataFrame()
df['id'] = ['a', 1, 'n']
df['md'] = ['d', 6, 'l']
df['c1'] = [2, 6, 5]
df['c2'] = [0, 5, 3]
df['c3'] = [8, 7,4]
df
Note:
I have used the following code, however it is very slow and for a big dataframe it take a looong time to run.
index_colum = df.columns.get_loc("c1")
df_withno_id = df.iloc[:,index_colum:]
rowsidx_with_all_NaN = df_withno_id[df_withno_id.isnull().all(axis=1)].index.values
df = df.drop(df.index[rowsidx_with_all_NaN])
for i, cols in df_withno_id.iterrows():
if i not in rowsidx_with_all_NaN:
endsidx = len(cols)
extract_data = list(cols[0:endsidx])
mean = np.nanmean(extract_data)
fill_nan = [mean for x in extract_data if np.isnan(x)]
df.loc[i] = df.loc[i].replace(np.nan, mean)
Can anybody help me with this? thanks.

First, you can select only float columns types. Second, for these columns drop rows with all nan values. Finally, you can transpose dataframe (only float columns), calculate average value and later transpose again.
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['id'] = ['a', 'b', 'c', 'n']
df['md'] = ['d', 'e', 'f', 'l']
df['c1'] = [2, np.nan,np.nan, 5]
df['c2'] = [0, 5, np.nan, 3]
df['c3'] = [8, 7, np.nan,np.nan]
numeric_cols = df.select_dtypes(include='float64').columns
df.dropna(how = 'all', subset = numeric_cols, inplace = True)
df[numeric_cols] = df[numeric_cols].T.fillna(df[numeric_cols].T.mean()).T
df

Related

insert python list in all rows new pd.Dataframe column

I have python list:
my_list = [1, 'V']
I have pd.Dataframe:
A B C
0 f v b
1 f i n
2 f i m
I need to create new column in my dataframe with value = my_list:
A B C D
0 f v b [1, 'V']
1 f i n [1, 'V']
2 f i m [1, 'V']
As far as I understand python lists can be values, bc df.groupby with apply "list":
df = df.groupby(['A', 'B'], group_keys=True)['C'].apply(list).reset_index(name='H')
A B H
0 f i [n, m]
1 f v [b]
Its posible without convert my_list type? What the the easiest way to do that?
I tried:
df['D'] = my_list
df['D'] = pd.Series(my_list)
but they did not meet my expectations

You can try using: np.repeat and set its repeat parameter to number of rows which can be found out from the shape of the dataframe.
my_list = [1, 'V']
df = pd.DataFrame({'col1': ['f', 'f', 'f'], 'col2': ['v', 'i', 'i'], 'col3': ['b', 'n', 'm']})
df['new_col'] = np.repeat(my_list, df.shape[0])
This will repeat the values of my_list as many times as there are rows in the DataFrame.

You can do it by creating a new array with my_list through hstack and then forming a new DataFrame. The code below has been tested and works fine.
import numpy as np
import pandas as ph
a1 = np.array([['f','v','b'], ['f','i','n'], ['f','i','m']])
a2 = np.array([1, 'V']).repeat(3).reshape(2,3).transpose()
df = pd.DataFrame(np.hstack((a1,a2)))
Edit: Another code that has been tested is:
import pandas as pd
import numpy as np
a1 = np.array([['f','v','b'], ['f','i','n'], ['f','i','m']])
a2 = np.squeeze(np.dstack((np.array(1).repeat(3), np.array('V').repeat(3))))
df = pd.DataFrame(np.hstack((a1,a2)))

Pandas: rolling total of checked out vs checked in items

I have a large data set that I need to calculate the number of checked out items vs the number of checked in items.
Sample data where rollingTotalCheckedOut describes the expected value. While items are checked out, the number of checked out items increases. When items are checked back in, the number of checked out items decreases.
df = pd.DataFrame([
['A', 1624990605, 1627102404, 1],
['A', 1624990635, 1625015061, 2],
['A', 1624990790, 1624991096, 3],
['A', 1624990790, 1624990913, 4],
['A', 1624990822, 1624991711, 5],
['A', 1624990945, 1624991096, 5],
['A', 1624991036, 1624991066, 6],
['A', 1624991067, 1624991188, 6],
],
columns = ['ID', 'out_ts', 'in_ts', 'rollingTotalCheckedOut'])
# some helpers
df['checkoutTime'] = pd.to_datetime(df['out_ts'], unit='s', origin='unix')
df['checkinTime'] = pd.to_datetime(df['in_ts'], unit='s', origin='unix')
I am not even sure how to best describe this problem. What is my strategy here / how to frame and tackle this problem? A rolling window does not seem suitable because, in this case, the first row is "checked out" for a very long time.

Here is what I got. not exactly your calculation but I can't immediately see an error. Will check again. But honestly I am not sure why you have a 5 in the end. Previous period ended but new just started.
import pandas as pd
df = pd.DataFrame([
['A', 1624990605, 1627102404, 1],
['A', 1624990635, 1625015061, 2],
['A', 1624990790, 1624991096, 3],
['A', 1624990790, 1624990913, 4],
['A', 1624990822, 1624991711, 5],
['A', 1624990945, 1624991096, 5],
['A', 1624991036, 1624991066, 6],
['A', 1624991067, 1624991188, 5],
],
columns = ['ID', 'out_ts', 'in_ts', 'rollingTotalCheckedOut'])
df["full_interval"] = df["out_ts"].astype("str") + "_" + df["in_ts"].astype("str")
df_out= df.drop(columns = ["in_ts"])
df_out["ts"] = df_out["out_ts"]
df_out["op"] = "out"
df_out["op_val"] = 1
df_in= df.drop(columns = ["out_ts"])
df_in["ts"] = df_in["in_ts"]
df_in["op"] = "in"
df_in["op_val"] = -1
df_stacked = pd.concat([df_out, df_in]).sort_values("ts")
df_stacked["rollingTotalCheckedOut"] = df_stacked["op_val"].cumsum()
df_stacked = df_stacked.sort_values("out_ts").dropna(subset=["out_ts"])
df = df.merge(df_stacked.loc[:,["ID","full_interval", "rollingTotalCheckedOut"]], how="left", on=["ID", "full_interval"])
df = df.drop(columns=["full_interval"])
df
Output:
ID out_ts in_ts rollingTotalCheckedOut_x rollingTotalCheckedOut_y
0 A 1624990605 1627102404 1 1
1 A 1624990635 1625015061 2 2
2 A 1624990790 1624991096 3 3
3 A 1624990790 1624990913 4 4
4 A 1624990822 1624991711 5 5
5 A 1624990945 1624991096 5 5
6 A 1624991036 1624991066 6 6
7 A 1624991067 1624991188 5 6

Group by the dataframe with specific id and then plot with another columns

I have a dataframe as follow, I have grouped them the columns with "specific_id". However, I need to plot the data frame based on the column "time" ( if the time is sorted, it will be great also).
Here is the dataframe I have,
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['time'] = ['2019-01-07 09:38:30', '2020-01-08 09:38:30', '2021-01-07 09:38:30',
'2020-01-07 09:38:30']
df['specific_id'] = ['d', 'd', 'f', 'f']
df['c1'] = [2, 3,7, 5]
df['c2'] = [0, 5, 10, 3]
df
I have group the dataframe with the following code,
df_sticked = df.filter(regex='c\d+', axis=1) \
.groupby(df['specific_id']).apply(np.ravel).apply(pd.Series) \
.rename(lambda x: f"c{x + 2}", axis=1).reset_index().fillna(0)
df_sticked
However, when I want to plot the data, I can not show the time in the x axis.
import matplotlib.pyplot as plt
%matplotlib inline
dfforplot = df_sticked.iloc[:, 1:-1]
dfforplot.T.plot(figsize=(20,11), legend=False)
plt.show()
Could you please help me?
Thanks

I simplified the steps in order to plot time by value:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.DataFrame()
df['time'] = ['2019-01-07 09:38:30', '2020-01-08 09:38:30', '2021-01-07 09:38:30',
'2020-01-07 09:38:30']
df['specific_id'] = ['d', 'd', 'f', 'f']
df['c1'] = [2, 3,7, 5]
df['c2'] = [0, 5, 10, 3]
df['c'] = df.filter(regex='c\d+', axis=1).values.tolist()
df = df[['specific_id', 'time', 'c']]
df = df.explode('c')
items = [x for _, x in df.groupby('specific_id')]
for item in items:
plt.plot(item['time'].values,
item['c'].values,
label=item['specific_id'].iloc[0])
plt.legend(loc="upper left")
plt.show()

Count and sort pandas dataframe

I have a dataframe with column 'code' which I have sorted based on frequency.
In order to see what each code means, there is also a column 'note'.
For each counting/grouping of the 'code' column, I display the first note that is attached to the first 'code'
df.groupby('code')['note'].agg(['count', 'first']).sort_values('count', ascending=False)
Now my question is, how do I display only those rows that have frequency of e.g. >= 30?

Add a query call before you sort. Also, if you only want those rows EQUALing < insert frequency here >, sort_values isn't needed (right?!).
df.groupby('code')['note'].agg(['count', 'first']).query('count == 30')
If the question is for all groups with AT LEAST < insert frequency here >, then
(
df.groupby('code')
.note.agg(['count', 'first'])
.query('count >= 30')
.sort_values('count', ascending=False)
)
Why do I use query? It's a lot easier to pipe and chain with it.

You can just filter your result accordingly:
grp = grp[grp['count'] >= 30]
Example with data
import pandas as pd
df = pd.DataFrame({'code': [1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3],
'note': ['A', 'B', 'A', 'A', 'C', 'C', 'C', 'A', 'A',
'B', 'B', 'C', 'A', 'B'] })
res = df.groupby('code')['note'].agg(['count', 'first']).sort_values('count', ascending=False)
# count first
# code
# 2 5 C
# 3 5 B
# 1 4 A
res2 = res[res['count'] >= 5]
# count first
# code
# 2 5 C
# 3 5 B

pandas three-way joining multiple dataframes on columns

I have 3 CSV files. Each has the first column as the (string) names of people, while all the other columns in each dataframe are attributes of that person.
How can I "join" together all three CSV documents to create a single CSV with each row having all the attributes for each unique value of the person's string name?
The join() function in pandas specifies that I need a multiindex, but I'm confused about what a hierarchical indexing scheme has to do with making a join based on a single index.

Zero's answer is basically a reduce operation. If I had more than a handful of dataframes, I'd put them in a list like this (generated via list comprehensions or loops or whatnot):
dfs = [df0, df1, df2, ..., dfN]
Assuming they have a common column, like name in your example, I'd do the following:
import functools as ft
df_final = ft.reduce(lambda left, right: pd.merge(left, right, on='name'), dfs)
That way, your code should work with whatever number of dataframes you want to merge.

You could try this if you have 3 dataframes
# Merge multiple dataframes
df1 = pd.DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
['a', 5, 19],
['b', 14, 16],
['c', 4, 9]]),
columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
['a', 15, 49],
['b', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr31', 'attr32'])
pd.merge(pd.merge(df1,df2,on='name'),df3,on='name')
alternatively, as mentioned by cwharland
df1.merge(df2,on='name').merge(df3,on='name')

This is an ideal situation for the join method
The join method is built exactly for these types of situations. You can join any number of DataFrames together with it. The calling DataFrame joins with the index of the collection of passed DataFrames. To work with multiple DataFrames, you must put the joining columns in the index.
The code would look something like this:
filenames = ['fn1', 'fn2', 'fn3', 'fn4',....]
dfs = [pd.read_csv(filename, index_col=index_col) for filename in filenames)]
dfs[0].join(dfs[1:])
With #zero's data, you could do this:
df1 = pd.DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
['a', 5, 19],
['b', 14, 16],
['c', 4, 9]]),
columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
['a', 15, 49],
['b', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr31', 'attr32'])
dfs = [df1, df2, df3]
dfs = [df.set_index('name') for df in dfs]
dfs[0].join(dfs[1:])
attr11 attr12 attr21 attr22 attr31 attr32
name
a 5 9 5 19 15 49
b 4 61 14 16 4 36
c 24 9 4 9 14 9

In python 3.6.3 with pandas 0.22.0 you can also use concat as long as you set as index the columns you want to use for the joining:
pd.concat(
objs=(iDF.set_index('name') for iDF in (df1, df2, df3)),
axis=1,
join='inner'
).reset_index()
where df1, df2, and df3 are defined as in John Galt's answer:
import pandas as pd
df1 = pd.DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr12']
)
df2 = pd.DataFrame(np.array([
['a', 5, 19],
['b', 14, 16],
['c', 4, 9]]),
columns=['name', 'attr21', 'attr22']
)
df3 = pd.DataFrame(np.array([
['a', 15, 49],
['b', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr31', 'attr32']
)

This can also be done as follows for a list of dataframes df_list:
df = df_list[0]
for df_ in df_list[1:]:
df = df.merge(df_, on='join_col_name')
or if the dataframes are in a generator object (e.g. to reduce memory consumption):
df = next(df_list)
for df_ in df_list:
df = df.merge(df_, on='join_col_name')

Simple Solution:
If the column names are similar:
df1.merge(df2,on='col_name').merge(df3,on='col_name')
If the column names are different:
df1.merge(df2,left_on='col_name1', right_on='col_name2').merge(df3,left_on='col_name1', right_on='col_name3').drop(columns=['col_name2', 'col_name3']).rename(columns={'col_name1':'col_name'})

Here is a method to merge a dictionary of data frames while keeping the column names in sync with the dictionary. Also it fills in missing values if needed:
This is the function to merge a dict of data frames
def MergeDfDict(dfDict, onCols, how='outer', naFill=None):
keys = dfDict.keys()
for i in range(len(keys)):
key = keys[i]
df0 = dfDict[key]
cols = list(df0.columns)
valueCols = list(filter(lambda x: x not in (onCols), cols))
df0 = df0[onCols + valueCols]
df0.columns = onCols + [(s + '_' + key) for s in valueCols]
if (i == 0):
outDf = df0
else:
outDf = pd.merge(outDf, df0, how=how, on=onCols)
if (naFill != None):
outDf = outDf.fillna(naFill)
return(outDf)
OK, lets generates data and test this:
def GenDf(size):
df = pd.DataFrame({'categ1':np.random.choice(a=['a', 'b', 'c', 'd', 'e'], size=size, replace=True),
'categ2':np.random.choice(a=['A', 'B'], size=size, replace=True),
'col1':np.random.uniform(low=0.0, high=100.0, size=size),
'col2':np.random.uniform(low=0.0, high=100.0, size=size)
})
df = df.sort_values(['categ2', 'categ1', 'col1', 'col2'])
return(df)
size = 5
dfDict = {'US':GenDf(size), 'IN':GenDf(size), 'GER':GenDf(size)}
MergeDfDict(dfDict=dfDict, onCols=['categ1', 'categ2'], how='outer', naFill=0)

One does not need a multiindex to perform join operations.
One just need to set correctly the index column on which to perform the join operations (which command df.set_index('Name') for example)
The join operation is by default performed on index.
In your case, you just have to specify that the Name column corresponds to your index.
Below is an example
A tutorial may be useful.
# Simple example where dataframes index are the name on which to perform
# the join operations
import pandas as pd
import numpy as np
name = ['Sophia' ,'Emma' ,'Isabella' ,'Olivia' ,'Ava' ,'Emily' ,'Abigail' ,'Mia']
df1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=name)
df2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=name)
df3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=name)
df = df1.join(df2)
df = df.join(df3)
# If you have a 'Name' column that is not the index of your dataframe,
# one can set this column to be the index
# 1) Create a column 'Name' based on the previous index
df1['Name'] = df1.index
# 1) Select the index from column 'Name'
df1 = df1.set_index('Name')
# If indexes are different, one may have to play with parameter how
gf1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=range(8))
gf2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=range(2,10))
gf3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=range(4,12))
gf = gf1.join(gf2, how='outer')
gf = gf.join(gf3, how='outer')

There is another solution from the pandas documentation (that I don't see here),
using the .append
>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
A B
0 1 2
1 3 4
>>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
A B
0 5 6
1 7 8
>>> df.append(df2, ignore_index=True)
A B
0 1 2
1 3 4
2 5 6
3 7 8
The ignore_index=True is used to ignore the index of the appended dataframe, replacing it with the next index available in the source one.
If there are different column names, Nan will be introduced.

I tweaked the accepted answer to perform the operation for multiple dataframes on different suffix parameters using reduce and i guess it can be extended to different on parameters as well.
from functools import reduce
dfs_with_suffixes = [(df2,suffix2), (df3,suffix3),
(df4,suffix4)]
merge_one = lambda x,y,sfx:pd.merge(x,y,on=['col1','col2'..], suffixes=sfx)
merged = reduce(lambda left,right:merge_one(left,*right), dfs_with_suffixes, df1)

df1 = pd.DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr12']
)
df2 = pd.DataFrame(np.array([
['a', 5, 19],
['d', 14, 16]]
),
columns=['name', 'attr21', 'attr22']
)
df3 = pd.DataFrame(np.array([
['a', 15, 49],
['c', 4, 36],
['d', 14, 9]]),
columns=['name', 'attr31', 'attr32']
)
df4 = pd.DataFrame(np.array([
['a', 15, 49],
['c', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr41', 'attr42']
)
Three ways to join list dataframe
pandas.concat
dfs = [df1, df2, df3]
dfs = [df.set_index('name') for df in dfs]
# cant not run if index not unique
dfs = pd.concat(dfs, join='outer', axis = 1)
functools.reduce
dfs = [df1, df2, df3, df4]
# still run with index not unique
import functools as ft
df_final = ft.reduce(lambda left, right: pd.merge(left, right, on='name', how = 'outer'), dfs)
join
# cant not run if index not unique
dfs = [df1, df2, df3]
dfs = [df.set_index('name') for df in dfs]
dfs[0].join(dfs[1:], how = 'outer')

Joining together all three can be done using .join() function.
You have three DataFrames lets say
df1, df2, df3.
To join these into one DataFrame you can:
df = df1.join(df2).join(df3)
This is the simplest way I found to do this task.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Manage the missing value in a dataframe with string and number - python

Related

insert python list in all rows new pd.Dataframe column

Pandas: rolling total of checked out vs checked in items

Group by the dataframe with specific id and then plot with another columns

Count and sort pandas dataframe

pandas three-way joining multiple dataframes on columns

Categories

Resources