Python merge insert duplicate column values instead of creating _x & _y columns - python

Hi I have 2 data sets:
Data A:
Column A Column B Column C
Hello NaN John
Bye NaN Mike
Data B:
Column A Column B
Hello 123
Raw data:
a = pd.DataFrame([['Hello', np.nan,'John'],['Bye',np.nan,'Mike']], columns=['Column A','Column B','Column C'])
b = pd.DataFrame([['Hello', 123]], columns=['Column A','Column B'])
I want to merge Data A & B using left join (as Data A should be the main data and only bring in if they have matching Column A on Data B), and want to bring in Data B's Column B's numeric onto Data A's Column B.
The columns match but my script below results in two Column B's.
df=a.merge(b, on ='Column A', how='left')
df:
Column A Column B_x Column C Column B_y
Hello NaN John 123
Bye NaN Mike
I want the following result:
Column A Column B Column C
Hello 123 John
Bye NaN Mike
Please note I need to effectively insert Column B's data correlating to Column A, not just push Data B into Data A in exact row order. I need the code to find the match for Column A regardless of which row it's located in and insert them appropriately.

You don't need a merge for this as a merge will bring the columns of the two dataframes together. Since your dataframes follow the same structure, fillna or update:
a.fillna(b, inplace = True) # not in place unless you specify inplace=True
a.update(b) # modifies NA in place using non-NA values from another DataFrame
print(a)
Column A Column B Column C
0 Hello 123.0 John
1 Bye NaN Mike

Related

Is there a way to put together or merge lists dataframe together to create a dataframe based on their index?

DF1 =[
Column A
Column B
Cell 1
Cell 2
Cell 3
Cell 4
Column A
Column B
Cell 1
Cell 2
Cell 3
Cell 4
]
DF2 = [ NY, FL ]
in this case DF1 and DF2 have two indexes.
The result I am looking for is the following
Main_DF =
[
Column A
Column B
Column C
Cell 1
Cell 2
NY
Cell 3
Cell 4
NY
Column A
Column B
Column C
Cell 1
Cell 2
FL
Cell 3
Cell 4
FL
]
I tried to use pd.concat, assign and insert
none give me the way I'm looking for the result to be
Lists hold references to dataframes. So, you can amend the dataframes and not need to amend the list at all.
So, I'd do something like...
for df, val in zip(DF1, DF2):
df['Column C'] = val
Using zip allows you to iterate though the two lists in sync with each other;
1st element of DF1 goes in to df, and 1st element of DF2 goes into val
2nd element of DF1 goes in to df, and 2nd element of DF2 goes into val
and so on

Extract a number in one column after - using pandas

I want to extract whatever come after -
My data in column A looks like
Column A
Column B
001-3
5
002-14
6
what I want is
Column A
Column B
Column C
001-3
3
5
002-14
14
6
Is there any function like scan for SAS in python, so I can extract the what come after "-" in column A and place it at column B, after move B to C
# reassign Column B to Column C
df['Column C'] = df['Column B']
#split from right and limit to only single split, then take the right value to create column B
df['Column B']=df['Column A'].str.rsplit('-',n=1,expand=True)[1]
df
Column A Column B Column C
0 001-3 3 5
1 002-14 14 6

categorize string from one column in another column - python

I have dataframe with 3 columns. Column A contains titles on a lot of products, Column B contains all brand names and Column C contains models/series of all products. Column A got +2000 rows, column B got about 50 rows and Column C got about 200 rows. I want to create a new Column D, that categorizes if the Title in Column A includes Brand, Models or is Generic
Example on my dataframe and desired result in Column D
Column A Column B Column C Column D
Running shoes Nike Airmax 2 Generic
Nike airmax 2 Adidas All stars Model/series
Airmax 2 Converse Ultraboost Model/series
Nike Shoes Puma Questar Brand
If a row in column A contains brand and model I want Column D to categorize the row as model/serie. All rows in Column A that cannot get match with Brand or Models/series should be categorized as Generic.
I began trying with this:
df['Column D'] = df.apply(lambda x: x.Column_b in x.Column_a, axis=1)
Here I got an error because column B has a lot less rows than Column A.
Then i wondered if looping even is the right way to do it or if i need to do a regex.
Any help on how to accomplish getting the desired Column D, would be highly appreciated.
Use, Series.str.contains to create a boolean masks m1 where the truthy values in this mask corresponds to the condition where Column A contains the values from Column B in the similar manner create the boolean mask m2, then use np.select to select the values from choices based on the conditions based on m1 and m2:
m1 = df['Column A'].str.contains('|'.join(df['Column B']), case=False)
m2 = df['Column A'].str.contains('|'.join(df['Column C']), case=False)
df['Column D'] = np.select(
[m1 & m2, m1, m2], ['Model/series', 'Brand', 'Model/series'], 'Generic')
# print(df)
Column A Column B Column C Column D
0 Running shoes Nike Airmax 2 Generic
1 Nike airmax 2 Adidas All stars Model/series
2 Airmax 2 Converse Ultraboost Model/series
3 Nike Shoes Puma Questar Brand
Maybe something like:
df['D'] = ['Brand' if x in df['B'].values else 'Model/Series' if x in df['C'].values else 'Generic' for x in df['A']]
I'm not 100% sure if your data can contain both a column B and a column C instance in a column A instance, but if so it's trivial to add another else if inside the list comprehension to catch both

How to compare one column value available or not in another column dataframe and extract another column of second dataframe if present

I have two dataframes like below -
df1_data = {'id' :{0:'101',1:'102',2:'103',3:'104',4:'105'},
'sym1' :{0:'abc',1:'pqr',2:'xyz',3:'mno',4:'lmn'},
'a name' :{0:'a',1:'b',2:'c',3:'d',4:'e'}}
df1 = pd.DataFrame(df1_data)
print df1
df2_data = {'sym2' :{0:'abc',1:'xxx',2:'xyz'},
'a name' :{0:'k',1:'e',2:'t'}}
df2 = pd.DataFrame(df2_data)
print df2
I want to check sym1 available in df1 present in sym2 column of df2 or not and if present I want to extract that row's a name and add it into df1 as a new column new_col.
For that purpose I tried below snippet and it is working too but for my long dataframes it is not working. I'm facing below error and warning message -
pandas.core.indexing.IndexingError: Unalignable boolean Series key provided
UserWarning: Boolean Series key will be reindexed to match DataFrame index.
code snippet -
df1['root'] = df2[df1['sym1'].isin(df2.sym2)]['a name']
print df1
How I can grab these a name column from df2 and make new_col in df1 for particular row?
What you describe is a typical merge operation. In your particular case, you have two different data frames sharing an identifier column (sym1 and sym2) which align corresponding rows (or identities) that belong together. All you need to do is a merge on those identifier columns:
>>> to_merge = df2.rename(columns={"a name": "new_col"}) # rename to desired column name
>>> df_merged = pd.merge(df1, to_merge, how="left", left_on="sym1", right_on="sym2")
>>> print(df_merged)
a name id sym1 new_col sym2
0 a 101 abc k abc
1 b 102 pqr NaN NaN
2 c 103 xyz t xyz
3 d 104 mno NaN NaN
4 e 105 lmn NaN NaN
See the pandas merge documentation for more information here.

Python: create new row based on column names in DataFrame

I would like to know how to make a new row based on the column names row in a python dataframe, and append it to the same dataframe.
example
df = pd.DataFrame(np.random.randn(10, 5),columns=['abx', 'bbx', 'cbx', 'acx', 'bcx'])
I want to create a new row based on the column names that gives:
b | b | b | c | c |by taking the middle char of the column name.
the idea is to use that new row, later, for multi-indexing the columns.
I'm assuming this is what you want as you've not responded, we can append a new row by creating a dict from zipping the df columns and a list comprehension of the middle character (assuming that column name lengths are 3):
In [126]:
df.append(dict(zip(df.columns, [col[1] for col in df])), ignore_index=True)
Out[126]:
abx bbx cbx acx bcx
0 -0.373421 -0.1005462 -0.8280985 -0.1593167 1.335307
1 1.324328 -0.6189612 -0.743703 0.9419248 1.282682
2 0.3730312 -0.06697892 1.113707 -0.9691056 1.779643
3 -0.6644958 1.379606 -0.3751724 -1.135034 0.3287292
4 0.4406139 -0.5767996 -0.2267589 -1.384412 -0.03038372
5 -1.242734 -0.838923 -0.6724592 1.405247 -0.3716862
6 -1.682637 -1.69309 -1.291833 1.781704 0.6321988
7 -0.5793783 -0.6809975 1.03502 -0.6498381 -1.124236
8 1.589016 1.272961 -1.968225 0.5515182 0.3058628
9 -2.275342 2.892237 2.076253 -0.1422845 -0.09776171
10 b b b c c
ix --- lets you read the entire row-- you just say which ever row you want.
then you get your columns and assign them to the raw you want.
See the example below.
virData = DataFrame(df)
virData.columns = virData.ix[1].values
virData.columns

Categories

Resources