Why nan values are shown after reindexing? - python

I am trying to reindex the columns, but it's displaying nan values. I am not able to understand why?
data = {
"age": [50, 40, 30, 40],
"qualified": [True, False, False, False]
}
index = ["P", "Q", "R", "S"]
df = pd.DataFrame(data, index=index)
new = ["A", "B", "C", "D"]
newdf = df.reindex(new)
print(newdf)
Output:
age qualified
A NaN NaN
B NaN NaN
C NaN NaN
D NaN NaN

I think you need DataFrame.set_index, with nested list if need replace index values by new values:
new = ["A", "B", "C", "D"]
newdf = df.set_index([new])
#alternative
#newdf.index = new
print(newdf)
age qualified
A 50 True
B 40 False
C 30 False
D 40 False
Method DataFrame.reindex working different - it create new index by list with alignment data - it means first match existing values of index by values of new list new and for not matching values create NaNs:
data = {
"age": [50, 40, 30, 40],
"qualified": [True, False, False, False]
}
index = ["A", "Q", "D", "C"]
df = pd.DataFrame(data, index=index)
new = ["A", "B", "C"]
newdf = df.reindex(new)
print(newdf)
age qualified
A 50.0 True
B NaN NaN
C 40.0 False

Related

Pandas new col with indexes of rows sharing a code in another col

Let say I've a DataFrame indexed on unique Code. Each entry may herit from another (unique) entry: the parent's Code is given in col Herit.
I need a new column giving the list of children for every entries. I can obtain it providing the Code, but I don't succeed in setting up the whole column.
Here is my M(non)WE:
import pandas as pd
data = pd.DataFrame({
"Code": ["a", "aa", "ab", "b", "ba", "c"],
"Herit": ["", "a", "a", "", "b", ""],
"C": [12, 15, 13, 12, 14, 10]
}
)
data.set_index("Code", inplace=True)
print(data)
child_a = data[data.Herit == "a"].index.values
print(child_a)
data["child"] = data.apply(lambda x: data[data.Herit == x.index].index.values, axis=1)
print(data)
You can group by the Herit column and then reduce the corresponding Codes into lists:
>>> herits = df.groupby("Herit").Code.agg(list)
>>> herits
Herit
[a, b, c]
a [aa, ab]
b [ba]
Then you can map the Code column of your frame with this and assign to a new column and fill the slots who don't have any children with "":
>>> df["Children"] = df.Code.map(herits).fillna("")
>>> df
Code Herit C Children
0 a 12 [aa, ab]
1 aa a 15
2 ab a 13
3 b 12 [ba]
4 ba b 14
5 c 10

How to replace string with values from one datafame to another dataframe

I have two dataframes, where in one dataframe(df1) each user is having string values, while in another dataframe (df2) there is a value associated with string values.
I want to have a new dataframe similar to df1 but with string being replaced with values corresponding to df2. Let me know if a simple method exist to create such new dataframe?
here are the sample query for df1 and df2
df1 = pd.DataFrame({"user": ["user1", "user2", "user3", "user4"], "p1": ["A", "C", "D", "D"],"p2": ["B", "D", "D", "A"],"p3": ["A", "B", "C", "D"],"p4": ["D", "A", "B", "C"], }, index=[0, 1, 2, 3], )
df2 = pd.DataFrame({"N1": ["A", "B", "C", "D"],"N2": ["1", "2", "5", "6"], }, index=[0, 1, 2, 3], )
My desired output should look like this
You can use df.stack() with Series.map and df.unstack:
In [95]: df3 = df1.set_index('user').stack().map(df2.set_index('N1')['N2']).unstack()
In [96]: df3
Out[96]:
p1 p2 p3 p4
user
user1 1 2 1 6
user2 5 6 2 1
user3 6 6 5 2
user4 6 1 6 5

equivalent python and pandas operation for group_by + mutate + indexing column vectors within mutate in R

Sample data frame in Python:
d = {'col1': ["a", "a", "a", "b", "b", "b", "c", "c", "c"],
'col2': [3, 4, 5, 1, 3, 9, 5, 7, 23]}
df = pd.DataFrame(data=d)
Now I want to get the same output in Python with pandas as I get in R with the code below. So I want to get the change in percentage in col1 by group in col2.
data.frame(col1 = c("a", "a", "a", "b", "b", "b", "c", "c", "c"),
col2 = c(3, 4, 5, 1, 3, 9, 16, 18, 23)) -> df
df %>%
dplyr::group_by(col1) %>%
dplyr::mutate(perc = (dplyr::last(col2) - col2[1]) / col2[1])
In python, I tried:
def perc_change(column):
index_1 = tu_in[column].iloc[0]
index_2 = tu_in[column].iloc[-1]
perc_change = (index_2 - index_1) / index_1
return(perc_change)
d = {'col1': ["a", "a", "a", "b", "b", "b", "c", "c", "c"],
'col2': [3, 4, 5, 1, 3, 9, 5, 7, 23]}
df = pd.DataFrame(data=d)
df.assign(perc_change = lambda x: x.groupby["col1"]["col2"].transform(perc_change))
But it gives me an error saying: 'method' object is not subscriptable.
I am new to python and trying to convert some R code into python. How can I solve this in an elegant way? Thank you!
You don't want transform here. transform is typically used when your aggregation returns a scalar value per group and you want to broadcast that result to all rows that belong to that group in the original DataFrame. Because GroupBy.pct_change already returns a result indexed like the original, you aggregate and assign back.
df['perc_change'] = df.groupby('col1')['col2'].pct_change()
# col1 col2 perc_change
#0 a 3 NaN
#1 a 4 0.333333
#2 a 5 0.250000
#3 b 1 NaN
#4 b 3 2.000000
#5 b 9 2.000000
#6 c 5 NaN
#7 c 7 0.400000
#8 c 23 2.285714
But if instead what you need is the overall percentage change within a group, so it's the difference in the first and last value divided by the first value, you would then want transform.
df.groupby('col1')['col2'].transform(lambda x: (x.iloc[-1] - x.iloc[0])/x.iloc[0])
0 0.666667
1 0.666667
2 0.666667
3 8.000000
4 8.000000
5 8.000000
6 3.600000
7 3.600000
8 3.600000
Name: col2, dtype: float64

Check a condition in the cells of a column and return a value if the condition is fullfiled, using lambda (PYTHON)

Imagine the next dataframe
data = pd.DataFrame({"col1" : ["a", "b", "z","w", "g", "p", "f"], "col2" :
["010", "030","500","333","090","050","111"]})
I want to use a lambda function to remove the first prefix 0 of the cells in col2.
What I have tried is
data["col2"].apply(lambda row: row["col2"][1:] if row["col2"]
[0:1] == "0" else row["col2"])
But is not working, returning the next error
TypeError: string indices must be integers
So col2 should appear like 10, 30, 500, 333, 90, 50, 111
no need to use 'col2'
data["col2"].apply(lambda row: row[1:] if row[0:1] == "0" else row)
You can also try regex in python:
data = pd.DataFrame({"col1" : ["a", "b", "z","w", "g", "p", "f"], "col2" :
["010", "030","500","333","090","050","111"]})
data['col2'] = data['col2'].apply(lambda x:re.sub(r"^0", '', x))
output:
col1 col2
0 a 10
1 b 30
2 z 500
3 w 333
4 g 90
5 p 50
6 f 111
to_numeric()-Convert argument to a numeric type.
astype()-used to change data type of a series.
Ex.
import pandas as pd
df = pd.DataFrame({"col1" : ["a", "b", "z","w", "g", "p", "f"], "col2" :
["010", "030","500","333","090","050","111"]})
df.col2 = pd.to_numeric(df.col2, errors='coerce').astype(str)
#or
#df.col2 = df.col2.astype(int).astype(str)
print(df)
O/P:
col1 col2
0 a 10
1 b 30
2 z 500
3 w 333
4 g 90
5 p 50
6 f 111

pandas groupby - custom function

I have the following dataframe to which I use groupby and sum():
d = {'col1': ["A", "A", "A", "B", "B", "B", "C", "C","C"], 'col2': [1,2,3,4,5,6, np.nan, np.nan, np.nan]}
df = pd.DataFrame(data=d)
df.groupby("col1").sum()
This results in the following:
col1 col2
A 6.0
B 15.0
C 0.0
I want C to show NaN instead of 0 since all of the values for C are NaN. How can I accomplish this? Apply() with a lambda function? Any help would be appreciated.
Use this:
df.groupby('col1').apply(pd.DataFrame.sum,skipna=False).reset_index(drop=True)
#Or --> df.groupby('col1',as_index=False).apply(pd.DataFrame.sum,skipna=False)
Without the apply() thanks to #piRSquared:
df.set_index('col1').sum(level=0, min_count=1).reset_index()
thanks #Alollz :
If you want to return sum of groups containing NaN and not just NaNs
df.set_index('col1').sum(level=0,min_count=1).reset_index()
Output
col1 col2
0 AAA 6.0
1 BBB 15.0
2 CCC NaN
Thanks to #piRSquared, #Alollz, and #anky_91:
You can use without setting index and reset index:
d = {'col1': ["A", "A", "A", "B", "B", "B", "C", "C","C"], 'col2': [1,2,3,4,5,6, np.nan, np.nan, np.nan]}
df = pd.DataFrame(data=d)
df.groupby("col1", as_index=False).sum(min_count=1)
Output:
col1 col2
0 A 6.0
1 B 15.0
2 C NaN
make the call to sum have the parameter skipna = False.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html
that link should provide the documentation you need and I expect that will fix your problem.

Categories

Resources