updating sub-set dataframe to update parent dataframe - python

I have a 4x4 dataframe (df). I created two child dataframes (4x1), (4x2). And updated both. In first case, the parent is updated, in second, it is not. How to ensure that the parent dataframe is updated when child dataframe is updated?
I have a 4x4 dataframe (df). From this as a parent, I created two child dataframes - dfA with single column (4x1) and dfB with two columns (4x2). I have NaN values in both subsets. Now, when I use fillna on both, in respective dfA and dfB, i can see the NaN values updated with given value. Fine upto now. However, now when I check the Parent Dataframe, in First case (4x1), the updated value reflects whereas in Second case (4x2), it does not. Why it is so. And What should I do to let the changes in child dataframe reflect in the parent dataframe.
studentnames = ['Maths','English','Soc.Sci', 'Hindi', 'Science']
semisteronemarks = [15, 50, np.NaN, 50, np.NaN]
semistertwomarks = [25, 53, 45, 45, 54]
semisterthreemarks = [20, 50, 45, 15, 38]
semisterfourmarks = [26, 33, np.NaN, 35, 34]
semisters = ['Rakesh','Rohit', 'Sam', 'Sunil']
df1 = pd.DataFrame([semisteronemarks,semistertwomarks,semisterthreemarks,semisterfourmarks],semisters, studentnames)
# case 1
dfA = df['Soc.Sci']
dfA.fillna(value = 98, inplace = True)
print(dfA)
print(df)
# case 2
dfB = df[['Soc.Sci', 'Science']]
dfB.fillna(value = 99, inplace = True)
print(dfB)
print(df)
'''
## contents of parent df ->>
## Actual Output -
# case 1
Maths English Soc.Sci Hindi Science
Rakesh 15 50 98.0 50 NaN
Rohit 25 53 45.0 45 54.0
Sam 20 50 45.0 15 38.0
Sunil 26 33 98.0 35 34.0
# case 2
Maths English Soc.Sci Hindi Science
Rakesh 15 50 NaN 50 NaN
Rohit 25 53 45.0 45 54.0
Sam 20 50 45.0 15 38.0
Sunil 26 33 NaN 35 34.0
## Expected Output -
# case 1
Maths English Soc.Sci Hindi Science
Rakesh 15 50 98.0 50 NaN
Rohit 25 53 45.0 45 54.0
Sam 20 50 45.0 15 38.0
Sunil 26 33 98.0 35 34.0
# case 2
Maths English Soc.Sci Hindi Science
Rakesh 15 50 99.0 50 NaN
Rohit 25 53 45.0 45 54.0
Sam 20 50 45.0 15 38.0
Sunil 26 33 99.0 35 34.0
# note the difference in output for column Soc.Sci in case 2.

In your code df1 is defined df is not.
With the approach being used
# case 1
dfA = df1['Soc.Sci'] # changed df to df1
dfA.fillna(value = 98, inplace = True)
df1['Soc.Sci'] = dfA # Because dfA is not a dataframe but a series
# if you want to do
df1['Soc.Sci'] = dfA['Soc.Sci']
# you will need to change the dfA
dfA = df1[['Soc.Sci']] # this makes it a dataframe
# case 2
dfB = df1[['Soc.Sci', 'Science']] # changed df to df1
dfB.fillna(value = 99, inplace = True)
df1[['Soc.Sci','Science']] = dfB[['Soc.Sci','Science']]
print(df1)
I would suggest just using the fillna in the parent df.
df1['Soc.Sci'].fillna(value=99,inplace=True)

You should have seen a warning:
Warning (from warnings module):
...
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
It means that dfB may be a copy instead of a view. And according to the results it is. There is little that can be done here, and specifically you cannot force pandas to generate a view. The choice depends of parameters only known to pandas and its developpers.
But it is always possible to assign to the columns of the parent DataFrame:
# case 2
df = pd.DataFrame([semisteronemarks,semistertwomarks,semisterthreemarks,semisterfourmarks],semisters, studentnames)
df[['Soc.Sci', 'Science']] = df[['Soc.Sci', 'Science']].fillna(value = 99)
print(df)

Related

Move the 4th column in each row to next row in python

I have a file with 4 columns(csv file) and n lines.
I want the 4th column values to move to the next line every time.
ex :
[LN],[cb],[I], [LS]
to
[LN],[cb],[I]
[LS]
that is, if my file is:
[LN1],[cb1],[I1], [LS1]
[LN2],[cb2],[I2], [LS2]
[LN3],[cb2],[I3], [LS3]
[LN4],[cb4],[I4], [LS4]
the output file will look like
[LN1],[cb1],[I1]
[LS1]
[LN2],[cb2],[I2]
[LS2]
[LN3],[cb2],[I3]
[LS3]
[LN4],[cb4],[I4]
[LS4]
Test file:
101 Xavier Mexico City 41 88.0
102 Ann Toronto 28 79.0
103 Jana Prague 33 81.0
104 Yi Shanghai 34 80.0
105 Robin Manchester 38 68.0
Output required:
101 Xavier Mexico City 41
88.0
102 Ann Toronto 28
79.0
103 Jana Prague 33
81.0
104 Yi Shanghai 34
80.0
105 Robin Manchester 38
68.0
Split the dataframe into 2 dataframes, one with the first 3 columns and the other with the last column. Add a new helper-column to both so you can order them afterwards. Now combine them again and order them first by index (which is identical for entries which where previously in the same row) and then by the helper column.
Since there is no test data, this answer is untested:
from io import StringIO
import pandas as pd
s = """col1,col2,col3,col4
101 Xavier,Mexico City,41,88.0
102 Ann,Toronto,28,79.0
103 Jana,Prague,33,81.0
104 Yi,Shanghai,34,80.0
105 Robin,Manchester,38,68.0"""
df = pd.read_csv(StringIO(s), sep=',')
df1 = df[['col1', 'col2', 'col3']].copy()
df2 = df[['col4']].rename(columns={'col4':'col1'}).copy()
df1['ranking'] = 1
df2['ranking'] = 2
df_out = df1.append(df2)
df_out = df_out.rename_axis('index_name').sort_values(by=['index_name', 'ranking'], ascending=[True, True])
df_out = df_out.drop(['ranking'], axis=1)
Another solution to this is to convert the table to a list, then rearrange the list to reconstruct the table.
import pandas as pd
df = pd.read_csv(r"test_file.csv")
df_list = df.values.tolist()
new_list = []
for x in df_list:
# Removes last element from list and save it to a variable 'last_x'.
# This action also modifies the original list
last_x = x.pop()
# append the modified list and the last values to an empty list.
new_list.append(x)
new_list.append([last_x])
# Use the new list to create the new table...
new_df = pd.DataFrame(new_list)

Adding calculated row in Pandas

gender math score reading score writing score
female 65 73 74
male 69 66 64
Given the dataframe (see above) how can we add a line that would calculate the difference between the row values in the following way :
gender math score reading score writing score
female 65 73 74
male 69 66 64
Difference -3 7 10
Or is there a more convenient way of expressing the difference between the rows?
Thank you in advance
Let -
df = pd.DataFrame({"A":[5, 10], "B":[9, 8], "gender": ["female", "male"]}).set_index("gender")
df.loc['Difference'] = df.apply(lambda x: x["female"]-x["male"])
In a one-liner with .loc[] and .diff():
df.loc['Difference'] = df.diff(-1).dropna().values.tolist()[0]
Another idea would be to work with a transposed dataframe and then transpose it back:
import pandas as pd
df = pd.DataFrame({'gender':['male','female'],'math score':[65,69],'reading score':[73,66],'writing score':[74,64]}).set_index('gender')
df = df.T
df['Difference'] = df.diff(axis=1)['female'].values
df = df.T
Output:
math score reading score writing score
gender
male 65.0 73.0 74.0
female 69.0 66.0 64.0
Difference 4.0 -7.0 -10.0
You can calculate the diff by selecting each row and then subtracting. But as you've correctly guessed, that is not the best way to do this. A more convenient way would be to transpose the df and then do subtraction:
import pandas as pd
df = pd.DataFrame([[65, 73, 74], [69, 66, 64]],
index=['female', 'male'],
columns=['math score', 'reading score', 'writing score'])
df_ = df.T
df_['Difference'] = df_['female'] - df_['male']
This is what you get:
female male Difference
math score 65 69 -4
reading score 73 66 7
writing score 74 64 10
If you want you can transpose it again df_.T, to revert back to it's initial form.

Skipping the row if there are more than 2 fields are empty

First, skip the row of data if the columns have more than 2 columns that are empty. After this step, the rows with more than 2 columns missing value will be filtered out.
Then, as some of the columns still have 1 or 2 columns are empty. So I will fill in the empty column with the mean value of that row.
I can run the second step with my code below, however, I am not sure how to filter out the rows with more than 2 columns missing value.
I have tried using dropna but it deleted all the columns of the table.
My code:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as pp
%matplotlib inline
# high technology exports percentage of manufatory exports
hightech_export = pd.read_csv('hightech_export_1.csv')
#skip the row of data if the columns have more than 2 columns are empty
hightech_export.dropna(axis=1, how='any', thresh=2, subset=None, inplace=False)
# Fill in data with mean value.
m = hightech_export.mean(axis=1)
for i, col in enumerate(hightech_export):
hightech_export.iloc[:, i] = hightech_export.iloc[:, i].fillna(m)
My dataset:
Country Name 2001 2002 2003 2004
Philippines 71
Malta 62 58 60 58
Singapore 60 56
Malaysia 58 57 55
Ireland 47 41 34 34
Georgia 38 41 24 38
Costa Rica
You can make use of .isnull() method for doing your first task.
Replace this:
hightech_export.dropna(axis=1, how='any', thresh=2, subset=None, inplace=False)
with:
hightech_export= hightech_export.loc[hightech_export.isnull().sum(axis=1)<=2]
Ok try this ...
import pandas as pd
import numpy as np
data1={'Name':['Tom',np.NaN,'Mary','Jane'],'Age':[20,np.NaN,40,30],'Pay':[np.NaN,np.NaN,20,25]}
data2={'Name':['Tom','Bob','Mary'],'Age':[40,30,20]}
df1=pd.DataFrame.from_records(data1)
Check the df
df1
Age Name Pay
0 20.0 Tom NaN
1 NaN NaN NaN
2 40.0 Mary 20.0
3 30.0 Jane 25.0
record with index 1 has 3 missing values...
Replace and make missing values None
df1 = df1.replace({pd.np.nan: None})
Now write function to count missing values per row.... and to create a list
def count_na(lst):
missing = [n for n in lst if not n]
return len(missing)
missing_data=[]
for index,n in df1.iterrows():
missing_data.append(count_na(list(n)))
Use this list as a new Column in the Dataframe
df1['missing']=missing_data
df1 should look like this
Age Name Pay missing
0 20 Tom None 1
1 None None None 3
2 40 Mary 20 0
3 30 Jane 25 0
So filtering becomes easy....
# Now only take records with <2 missing
df1[df1.missing<2]
Hope that helps...
A simple way is to compare on a row basis the count of value and the number of columns of the dataframe. You can then just replace NaN with the avg of the dataframe.
Code could be:
result = df.loc[df.apply(lambda x: x.count(), axis=1) >= (len(df.columns) - 2)].replace(
np.nan, df.agg('mean'))
With your example data, it gives as expected:
Country Name 2001 2002 2003 2004
1 Malta 62.0 58.00 60.000000 58.0
2 Singapore 60.0 49.25 39.333333 56.0
3 Malaysia 58.0 57.00 39.333333 55.0
4 Ireland 47.0 41.00 34.000000 34.0
5 Georgia 38.0 41.00 24.000000 38.0
Try this
hightech_export.dropna(thresh=2, inplace=True)
in place of the line of code
hightech_export.dropna(axis=1, how='any', thresh=2, subset=None, inplace=False)

How to find the index of a value by row in a dataframe in python and extract the value of the following column

I have the following dataframe using pandas
df = pd.DataFrame({'Last_Name': ['Smith', None, 'Brown'],
'Date0': ['01/01/1999','01/06/1999','01/01/1979'], 'Age0': [29,44,21],
'Date1': ['08/01/1999','07/01/2014','01/01/2016'],'Age1': [35, 45, 47],
'Date2': [None,'01/06/2035','08/01/1979'],'Age2': [47, None, 74],
'Last_age': [47,45,74]})
I would like to add new column to get the date corresponding to the value presents in 'Last_age' for each row to get something like that :
df = pd.DataFrame({'Last_Name': ['Smith', None, 'Brown'],
'Date0': ['01/01/1999','01/06/1999','01/01/1979'], 'Age0': [29,44,21],
'Date1': ['08/01/1999','07/01/2014','01/01/2016'],'Age1': [35, 45, 47],
'Date2': [None,'01/06/2035','08/01/1979'],'Age2': [47, None, 74],
'Last_age': [47,45,74],
'Last_age_date': ['Error no date','07/01/2014','08/01/1979']})
I will just using wide_to_long reshape your df
s=pd.wide_to_long(df.reset_index(),['Date','Age'],i=['Last_age','index'],j='Drop')
s.loc[s.Age==s.index.get_level_values(0),'Date']
Out[199]:
Last_age index Drop
47 0 2 None
45 1 1 07/01/2014
74 2 2 08/01/1979
Name: Date, dtype: object
df['Last_age_date']=s.loc[s.Age==s.index.get_level_values(0),'Date'].values
df
Out[201]:
Last_Name Date0 Age0 ... Age2 Last_age Last_age_date
0 Smith 01/01/1999 29 ... 47.0 47 None
1 None 01/06/1999 44 ... NaN 45 07/01/2014
2 Brown 01/01/1979 21 ... 74.0 74 08/01/1979
[3 rows x 9 columns]
Something like this should do what you are looking for:
# get the age and column rows (you might have more than just the 2)
age_columns = [c for c in df.columns if 'Age' in c][::-1]
date_columns = [c for c in df.columns if 'Date' in c][::-1]
def get_last_age_date(row):
for age, date in zip(age_columns, date_columns):
if not np.isnan(row[age]):
return row[date]
return np.nan
# apply the function to all the rows in the dataframe
df['Last_age_date'] = df.apply(lambda row: get_last_age_date(row), axis=1)
# fix the NaN values to say 'Error no date'
df.Last_age_date.where(~df.Last_age_date.isna(), 'Error no date', inplace=True)
print(df)
Welcome to Stackoverflow! You can write a small function and achieve this. Your input dataframe looks like this.
df
Last_Name Date0 Age0 Date1 Age1 Date2 Age2 Last_age
0 Smith 01/01/1999 29 08/01/1999 35 None 47.0 47
1 None 01/06/1999 44 07/01/2014 45 01/06/2035 NaN 45
2 Brown 01/01/1979 21 01/01/2016 47 08/01/1979 74.0 74
Write a function like this:
def last_Age(row):
if row['Last_age'] == row['Age2']:
return row['Date2']
elif row['Last_age'] == row['Age1']:
return row['Date1']
elif row['Last_age'] == row['Age0']:
return row['Date0']
df['Last_age_date']=df.apply(last_Age, axis = 1)
df
Last_Name Date0 Age0 Date1 Age1 Date2 Age2 Last_age Last_age_date
0 Smith 01/01/1999 29 08/01/1999 35 None 47.0 47 None
1 None 01/06/1999 44 07/01/2014 45 01/06/2035 NaN 45 07/01/2014
2 Brown 01/01/1979 21 01/01/2016 47 08/01/1979 74.0 74 08/01/1979

Add 2 dataframes with no exact index values

I have this df:
df1
t0
ACER 50
BBV 20
ACC 75
IRAL 25
DECO 58
and on the other side :
df2
t1
ACER 50
BBV 20
CEL 0
DIA 25
I am looking forward to add both dfs to obtain the following output:
df3
t2
ACER 100
BBV 40
ACC 75
IRAL 25
DECO 58
DIA 25
CEL 0
Basically is the addition of the common index values in df1 and df2 and include those that didnt appeared in df1
I tried a
d1.add(df2)
but Nan values arised, I also thought in merging dfs, substitute Nans by zeros and add columns , but I think maybe is lot of code to perform this.
Use concat and sum
df3 = pd.concat([df1, df2], axis=1).sum(1)
ACC 75.0
ACER 100.0
BBV 40.0
CEL 0.0
DECO 58.0
DIA 25.0
IRAL 25.0
You were actually on the right track, you just needed fill_value=0:
df1.t0.add(df2.t1, fill_value=0).to_frame('t2')
t2
ACC 75.0
ACER 100.0
BBV 40.0
CEL 0.0
DECO 58.0
DIA 25.0
IRAL 25.0
Note here that you'd have to add the Series objects together to prevent misalignment issues.

Categories

Resources