I have a dataframe that has two fields, name and alias. If one or more aliases appear, I must create a new row for each alias and replace the name field with the alias field. I have something like the first table and it should look like the second table using pandas in python. thanks in advance
IIUC, you want to add rows to get all possible names and aliases as rows.
You could get all unique names as set and reindex:
out = (
df.set_index('name')
.reindex(set(df['name']).union(*df['alias']))
.reset_index()
)
Output:
name alias
0 Juan [Perez, Juancho]
1 Perez NaN
2 Juancho NaN
Or transform "alias" to a renamed DataFrame and concat:
out = (
pd.concat([df,
df['alias'].explode().rename('name').to_frame()
])
.sort_index(kind='stable')
)
Output:
name alias
0 Juan [Perez, Juancho]
0 Perez NaN
0 Juancho NaN
Used input:
df = pd.DataFrame({'name': ['Juan'],
'alias': [['Perez', 'Juancho']]})
Related
Suppose I have the following df that I would like to reshape:
df6 = pd.DataFrame({'name':['Sara', 'John', 'Jack'],
'trip places': ['UK,UK,UK', 'US,US,US', 'AUS,AUS,AUS'],
'Trip code': ['UK322,UK454,UK4441', 'US664,US4544,US44', 'AUS11,AUS11,AUS11']
})
df6
Looks like:
name trip places Trip code
0 Sara UK,UK,UK UK322,UK454,UK4441
1 John US,US,US US664,US4544,US44
2 Jack AUS,AUS,AUS AUS11,AUS11,AUS11
I want to add a new column lets say df6['total-info'] and merge the current two columns trip places and Trip code in two rows per name, so the output will be like this:
name total-info
0 Sara UK,UK,UK
UK322,UK454,UK4441
1 John US,US,US
US664,US4544,US44
2 Jack AUS,AUS,AUS
AUS11,AUS11,AUS11
I tried to do so by many methods grouping/stack/unstack pivot .. etc but all what I tried does not generate the output I need and I am not completely familiar with the best function to do so. I also used concatenation but it generated one column and added all the two columns comma separated values altogether.
Use set_index, stack, droplevel then reset_index and specify the new column name:
df7 = (
df6
.set_index('name') # Preserve during reshaping
.stack() # Reshape
.droplevel(1) # Remove column names
.reset_index(name='total-info') # reset_index and name new column
)
df7:
name total-info
0 Sara UK,UK,UK
1 Sara UK322,UK454,UK4441
2 John US,US,US
3 John US664,US4544,US44
4 Jack AUS,AUS,AUS
5 Jack AUS11,AUS11,AUS11
Or if name is to be part of the multi-index append name and call to_frame
after stack and droplevel instead:
df7 = (
df6
.set_index('name', append=True) # Preserve during reshaping
.stack() # Reshape
.droplevel(2) # Remove column names
.to_frame(name='total-info') # Make DataFrame and name new column
)
total-info
name
0 Sara UK,UK,UK
Sara UK322,UK454,UK4441
1 John US,US,US
John US664,US4544,US44
2 Jack AUS,AUS,AUS
Jack AUS11,AUS11,AUS11
i am trying to add one more column in my existing dataframe
example
In[16] df1
Out[16]:
0
0 MDS31505B
But i want to add one more column which should be added at beginning at ) index column
In[16] df1
Out[16]:
0 1
0 SKUID MDS31505B
i am trying this but it does not affect
df1.insert(loc=0, column='0', value=SKUID)
name 'SKUID' is not defined
you can also try df1.insert(0,0,"skid")
How can we use Coalesce with multiple data frames.
columns_List = Emp_Id, Emp_Name, Dept_Id...
I have two data frames getting used in python script. df1[Columns_List] , df2[columns_List]. In both the dataframes i have same columns used but i will be having different values in both dataframes.
How can i use Coalesce so that lets say :In Dataframe df1[Columns_List] -- I have Emp_Name null then i want to pick Emp_Name from df2[Columns_list].
I am trying to create an output CSV file.
Please sorry if my framing of question is wrong..
Please find below sample data.
For Dataframe1 -- df1[Columns_List] .. Please find below output
EmpID,Emp_Name,Dept_id,DeptName
1,,1,
2,,2,
For Dataframe2 -- df2[Columns_List] .. Please find below output
EmpID,Emp_Name,Dept_id,DeptName
1,XXXXX,1,Sciece
2,YYYYY,2,Maths
I have source as Json file. Once i parse the data by python , i am using 2 dataframes in the same script. In Data frame 1 ( df1) i have Emp_Name & Dept_Name as null. In that case i want to pick data from Dataframe2 (df2).
In the above example i have provided few columns. But i may have n number of columns. but column ordering and column names will be always same. I am trying to achieve in such a way if any of the column from df1 is null then i want to pick value from df2.
Is that possible.. Please help me with any suggestionn...
You can use pandas.DataFrame.combine. This method does what you need: it builds a dataframe taking elements from two dataframes according to a custom function.
You can then write a custom function which picks the element from dataframe one unless that is null, in which case the element is taken from dataframe two.
Consider the two following dataframe. I built them according to your examples but with a small difference to emphatize that only emtpy string will be replaced:
columnlist = ["EmpID", "Emp_Name", "Dept_id", "DeptName"]
df1 = pd.DataFrame([[1, None, 1, np.NaN], [2, np.NaN, 2, None]], columns=columnlist)
df2 = pd.DataFrame([[1, "XXX", 2, "Science"], [2, "YYY", 3, "Math"]], columns=columnlist)
They are:
df1
EmpID Emp_Name Dept_id DeptName
0 1 NaN 1 NaN
1 2 NaN 2 NaN
df2
EmpID Emp_Name Dept_id DeptName
0 1 XXX 1 Science
1 2 YYY 3 Math
What you need to do is:
ddf = df1.combine(df2, lambda ss, rep_ss : pd.Series([r if pd.isna(x) else x for x, r in zip(ss, rep_ss)]))
to get ddf:
ddf
EmpID Emp_Name Dept_id DeptName
0 1 XXX 1 Science
1 2 YYY 2 Math
As you can see, only Null values in df1 have been replaced with the corresponding values in df2.
EDIT: A bit deeper explanation
Since I've been asked in the comments, let me give a bit of explanation more on the solution:
ddf = df1.combine(df2, lambda ss, rep_ss : pd.Series([r if pd.isna(x) else x for x, r in zip(ss, rep_ss)]))
Is a bit compact, but there is nothing much than some basic python techiques like list comprehensions, plus the use of pandas.DataFrame.combine. The pandas method is detailed in the docs I linked above. It compares the two dataframes column by column: the columns are passed to a custom function which must return a pandas.Series. This Series become a column in the returned dataframe.
In this case, the custom function is a lambda, which uses a list comprehension to loop over the pairs of elements (one from each column) and pick only one element of the pair (the first if not null, otherwise the second).
You can use a mask to get null values and replace those. The best part is that you don't have to eyeball anything; the function will find what to replace for you.
You can also adjust the pd.DataFrame.select_dtypes() function to suit your needs, or just go through multiple dtypes with appropriate conversion and detection measures being used.
import pandas as pd
ddict1 = {
'EmpID':[1,2],
'Emp_Name':['',''],
'Dept_id':[1,2],
'DeptName':['',''],
}
ddict2 = {
'EmpID':[1,2],
'Emp_Name':['XXXXX','YYYYY'],
'Dept_id':[1,2],
'DeptName':['Sciece','Maths'],
}
df1 = pd.DataFrame(ddict1)
df2 = pd.DataFrame(ddict2)
def replace_df_values(df_A, df_B):
## Select object dtypes
for i in df_A.select_dtypes(include=['object']):
### Check to see if column contains missing value
if len(df_A[df_A[i].str.contains('')]) > 0:
### Create mask for zero-length values (or null, your choice)
mask = df_A[i] == ''
### Replace on 1-for-1 basis using .loc[]
df_A.loc[mask, i] = df_B.loc[mask, i]
### Pass dataframes in reverse order to cover both scenarios
replace_df_values(df1, df2)
replace_df_values(df2, df1)
Initial values for df1:
EmpID Emp_Name Dept_id DeptName
0 1 1
1 2 2
Output for df1 after running function:
EmpID Emp_Name Dept_id DeptName
0 1 XXXXX 1 Sciece
1 2 YYYYY 2 Maths
I replicated your dataframes:
# df1
EmpID Emp_Name Dept_id DeptName
0 1 1
1 2 2
# df2
EmpID Emp_Name Dept_id DeptName
0 1 XXXXX 1 Sciece
1 2 YYYYY 2 Maths
If you want to replace missing values (NaN) from df1.column with existing values from df2.column, you could use .fillna(). For example:
df1['Emp_Name'].fillna(df2['Emp_Name'], inplace=True)
# df1
EmpID Emp_Name Dept_id DeptName
0 1 XXXXX 1
1 2 YYYYY 2
If you want to replace all values from a given column with the values from the same column of another dataframe, you could use list comprehension.
df1['DeptName'] = [ each for each in list(df2['DeptName'])]
EmpID Emp_Name Dept_id DeptName
0 1 XXXXX 1 Sciece
1 2 YYYYY 2 Maths
I'm sure there's a better way to do this, but I hope this helps!
From what I know, my current approach to change values in a pandas dataframe is far from optimal and is really hurting my workflow.
Example:
I want to check if a name is found inside another dataframe and if so put in values from this dataframe using iloc of the searched name from the first dataframe:
for idx in id_list_of_names:
name = df["name"].iloc[idx]
if name in df_two["name"].values:
df["value"].iloc[idx] = df_two["value"][df_two["name"]==name].values
Dataframe 1, df:
id | name | value
1 | "David" | 0
2 | "Lisa" | 0
...............
Dataframe 2, df_two:
id | name | value
1 | "Kevin" | 10
.................
255 | "David" | 22
.................
What I want to do is put the value from df_two for David (value = 22) in Dataframe 1 at the iloc of David (df["value"].iloc[1] == 22). This should happen for all the names from df if the respective name exists in df_two.
merge() is my usual solution to this, but since the column value already exists a new column value_1 would be created if I use merge in this case.
Why don't you merge your 2 dataframes on Name, and then apply a custom function to create a column final_value which chooses between value and value_1 ?
Use merge to merge the two dataframes together keeping df as the main dataframe (left merge). As you noted, since the column names are the same these be given new names (adding an _x and _y suffix).
First drop the id column from df_two and then merge:
df_two = df_two.drop('id', axis=1)
df = df.merge(df_two, on='name', how='left')
Now, create a new column value by using value_y when there is a value available, otherwise value_x:
df['value'] = df['value_y'].fillna(df['value_x'])
Last, drop the unwanted columns:
df = df.drop(['value_x', 'value_y'], axis=1)
I have two dataframes like below -
df1_data = {'id' :{0:'101',1:'102',2:'103',3:'104',4:'105'},
'sym1' :{0:'abc',1:'pqr',2:'xyz',3:'mno',4:'lmn'},
'a name' :{0:'a',1:'b',2:'c',3:'d',4:'e'}}
df1 = pd.DataFrame(df1_data)
print df1
df2_data = {'sym2' :{0:'abc',1:'xxx',2:'xyz'},
'a name' :{0:'k',1:'e',2:'t'}}
df2 = pd.DataFrame(df2_data)
print df2
I want to check sym1 available in df1 present in sym2 column of df2 or not and if present I want to extract that row's a name and add it into df1 as a new column new_col.
For that purpose I tried below snippet and it is working too but for my long dataframes it is not working. I'm facing below error and warning message -
pandas.core.indexing.IndexingError: Unalignable boolean Series key provided
UserWarning: Boolean Series key will be reindexed to match DataFrame index.
code snippet -
df1['root'] = df2[df1['sym1'].isin(df2.sym2)]['a name']
print df1
How I can grab these a name column from df2 and make new_col in df1 for particular row?
What you describe is a typical merge operation. In your particular case, you have two different data frames sharing an identifier column (sym1 and sym2) which align corresponding rows (or identities) that belong together. All you need to do is a merge on those identifier columns:
>>> to_merge = df2.rename(columns={"a name": "new_col"}) # rename to desired column name
>>> df_merged = pd.merge(df1, to_merge, how="left", left_on="sym1", right_on="sym2")
>>> print(df_merged)
a name id sym1 new_col sym2
0 a 101 abc k abc
1 b 102 pqr NaN NaN
2 c 103 xyz t xyz
3 d 104 mno NaN NaN
4 e 105 lmn NaN NaN
See the pandas merge documentation for more information here.