Splitting a column into multiple rows - python

I have this data in a dataframe, The code column has several values and is of object datatype.
I want to split the rows in the following way
result
I tried to change the datatype by using
df['Code'] = df['Code'].astype(str)
and then tried to split the commas and reset the index on the basis of ID (unique) but I only get two column values. I need the entire dataset.
df = (pd.DataFrame(df.Code.str.split(',').tolist(), index=df.ID).stack()).reset_index([0, 'ID'])
df.columns = ['ID', 'Code']
Can someone help me out? I don't understand how to twist this code.
Attaching the setup code:
import pandas as pd
x = {'ID': ['1','2','3','4','5','6','7'],
'A': ['a','b','c','a','b','b','c'],
'B': ['z','x','y','x','y','z','x'],
'C': ['s','d','w','','s','s','s'],
'D': ['m','j','j','h','m','h','h'],
'Code': ['AB,BC,A','AD,KL','AD,KL','AB,BC','A','A','B']
}
df = pd.DataFrame(x, columns = ['ID', 'A','B','C','D','Code'])
df

You can first split Code column on comma , then explode it to get the desired output.
df['Code']=df['Code'].str.split(',')
df=df.explode('Code')
OUTPUT:
ID A B C D Code
0 1 a z s m AB
0 1 a z s m BC
0 1 a z s m A
1 2 b x d j AD
1 2 b x d j KL
2 3 c y w j AD
2 3 c y w j KL
3 4 a x h AB
3 4 a x h BC
4 5 b y s m A
5 6 b z s h A
6 7 c x s h B
If needed, you can replace empty string by NaN

Related

python-CSV Multiple Columns with the same header into one column

I have a CSV file with company data with 22 rows and 6500 columns. The columns have the same names and I should get the columns with the same names stacked into individual columns according to their headers.
I have now the data in one df like this:
Y C Y C Y C
1. a 1. b. 1. c.
2. a. 2. b. 2. c.
and I need to get it like this:
Y C
1. a.
2. a.
1. b.
2. b.
1. c.
2. c.
I would try an attempt where you slice the df in chunks by iteration and concat them back together, since the column names can't be identified distinctly.
EDIT
Changed answer to new input:
chunksize = 2
df = (
pd.concat(
[
df.iloc[:, i:i+chunksize] for i in range(0, len(df.columns), chunksize)
]
)
.reset_index(drop=True))
print(df)
Y C
0 1 a
1 2 a
2 1 b
3 2 b
4 1 c
5 2 c
I couldn't resist looking for a solution.
The best I found so far accounts for the fact that pd.read_csv addresses repeated column names by appending '.N' to the duplicates.
In [2]: df = pd.read_csv('duplicate_columns.csv')
In [3]: df
Out[3]:
1 2 3 4 1.1 2.1 3.1 4.1 1.2 2.2 3.2 4.2
0 a q j e w e r t y u d s
1 b w w f c e f g d c s a
2 d q e h c f b f a w q r
To put your data into the same column...
Group the columns by their original names.
Apply a flattener to convert to a series of arrays.
Create a new data frame from the series viewed as a dict.
In [3]: grouper = lambda l: l.split('.')[0] # peels off added suffix
In [4]: flattener = lambda v: v.stack().values # reshape groups
In [4]: pd.DataFrame(df.groupby(by=grouper, axis='columns')
...: .apply(flattener)
...: .to_dict())
Out[4]:
1 2 3 4
0 a q j e
1 w e r t
2 y u d s
3 b w w f
4 c e f g
5 d c s a
6 d q e h
7 c f b f
8 a w q r
I'd love to see a cleaner, less obtuse, general solution.

How to insert a pandas dataframe having a single csv column into MySQL Database

I have a pandas dataframe that I read from google sheet.
I then added the tag column using:
df['tag'] = df.filter(like = 'Subject', axis = 1).apply(lambda x: np.where(x == 'Y', x.name,'')).values.tolist()
df['tag'] = df['tag'].apply(lambda x: [i for i in x if i!= ''])
Resultant sample DataFrame:
Id Name Subject-A Subject-B Total tag
0 1 A Y 100 [Subject-A]
1 2 B Y 98 [Subject-B]
2 3 C Y Y 191 [Subject-A, Subject-B]
3 4 D Y 100 [Subject-B]
4 5 E Y 95 [Subject-B]
Then I export the dataframe to a MySQL Database after converting the tag column into a comma separated string by:
df['tag'] = df['tag'].map(lambda x : ', '.join(str(i) for i in x)).str.replace('Subject-','')
df
Id Name Subject-A Subject-B Total tag
0 1 A Y 100 A
1 2 B Y 98 B
2 3 C Y Y 91 A, B
3 4 D Y 100 B
4 5 E Y 95 B
df.to_sql(name = 'table_name', con = conn, if_exists = 'replace', index = False)
But in the MySQL database the tag columns is:
A,
,B
A,B
,B
,B
My actual data has many such "Subject" columns so the result looks like:
, , , D
A, ,C,
...
...
Could someone please let me know why it's giving expected out in Pandas but when I save the dataframe in cloud SQL, the column looks different. The expected output in MySQL database is same as how the tag column is appearing in Pandas.
Here is alternative solution, seems some data related problem.
First filter Subject columns with remove Subject- and then use DataFrame.dot with columns names with separator, last strip separator from right side:
df1 = df.filter(like = 'Subject').rename(columns=lambda x: x.replace('Subject-',''))
print (df1)
A B
0 Y NaN
1 NaN Y
2 Y Y
3 NaN Y
4 NaN Y
df['tag'] = df1.eq('Y').dot(df1.columns + ', ').str.rstrip(', ')
print (df)
Id Name Subject-A Subject-B Total tag
0 1 A Y NaN 100 A
1 2 B NaN Y 98 B
2 3 C Y Y 191 A, B
3 4 D NaN Y 100 B
4 5 E NaN Y 95 B

Subtracting multiple columns between dataframes based on key

I have two dataframes, example:
Df1 -
A B C D
x j 5 2
y k 7 3
z l 9 4
Df2 -
A B C D
z o 1 1
x p 2 1
y q 3 1
I want to deduct columns C and D in Df2 from columns C and D in Df1 based on the key contained in column A.
I also want to ensure that column B remains untouched, example:
Df3 -
A B C D
x j 3 1
y k 4 2
z l 8 3
I found an almost perfect answer in the following thread:
Subtracting columns based on key column in pandas dataframe
However what the answer does not explain is if there are other columns in the primary df (such as column B) that should not be involved as an index or with the operation.
Is somebody please able to advise?
I was originally performing a loop which find the value in the other df and deducts it however this takes too long for my code to run with the size of data I am working with.
Idea is specify column(s) for maching and column(s) for subtract, convert all not cols columnsnames to MultiIndex, subtract:
match = ['A']
cols = ['C','D']
df1 = Df1.set_index(match + Df1.columns.difference(match + cols).tolist())
df = df1.sub(Df2.set_index(match)[cols], level=0).reset_index()
print (df)
A B C D
0 x j 3 1
1 y k 4 2
2 z l 8 3
Or replace not matched values to original Df1:
match = ['A']
cols = ['C','D']
df1 = Df1.set_index(match)
df = df1.sub(Df2.set_index(match)[cols], level=0).reset_index().fillna(Df1)
print (df)
A B C D
0 x j 3 1
1 y k 4 2
2 z l 8 3

How to fill a column based on several other columns?

I have two dataframes like this:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(
{
'A': list('aaabdcde'),
'B': list('smnipiuy'),
'C': list('zzzqqwll')
}
)
df2 = pd.DataFrame(
{
'mapcol': list('abpppozl')
}
)
A B C
0 a s z
1 a m z
2 a n z
3 b i q
4 d p q
5 c i w
6 d u l
7 e y l
mapcol
0 a
1 b
2 p
3 p
4 p
5 o
6 z
7 l
Now I want to create an additional column in df1 which should be filled with values coming from the columns A, B and C respectively, depending on whether their values can be found in df2['mapcol']. If the values in one row can be found in more than one column, they should be first used from A, then B and then C, so my expected outcome looks like this:
A B C final
0 a s z a # <- values can be found in A and C, but A is preferred
1 a m z a # <- values can be found in A and C, but A is preferred
2 a n z a # <- values can be found in A and C, but A is preferred
3 b i q b # <- value can be found in A
4 d p q p # <- value can be found in B
5 c i w NaN # none of the values can be mapped
6 d u l l # value can be found in C
7 e y l l # value can be found in C
A straightforward implementation could look like this (filling the column final iteratively using fillna in the preferred order):
preferred_order = ['A', 'B', 'C']
df1['final'] = np.nan
for col in preferred_order:
df1['final'] = df1['final'].fillna(df1[col][df1[col].isin(df2['mapcol'])])
which gives the desired outcome.
Does anyone see a solution that avoids the loop?
you can use where and isin on the full dataframe df1 to mask the value not in the df2, then reorder with the preferred_order and bfill along the column, keep the first column with iloc
preferred_order = ['A', 'B', 'C']
df1['final'] = (df1.where(df1.isin(df2['mapcol'].to_numpy()))
[preferred_order]
.bfill(axis=1)
.iloc[:, 0]
)
print (df1)
A B C final
0 a s z a
1 a m z a
2 a n z a
3 b i q b
4 d p q p
5 c i w NaN
6 d u l l
7 e y l l
Use:
order = ['A', 'B', 'C'] # order of columns
d = df1[order].isin(df2['mapcol'].tolist()).loc[lambda x: x.any(axis=1)].idxmax(axis=1)
df1.loc[d.index, 'final'] = df1.lookup(d.index, d)
Details:
Use DataFrame.isin and filter the rows using boolean masking with DataFrame.any along axis=1 then use DataFrame.idxmax along axis=1 to get column names names associated with max values along axis=1.
print(d)
0 A
1 A
2 A
3 A
4 B
6 C
7 C
dtype: object
Use DataFrame.lookup to lookup the values in df1 corresponding to the index and columns of d and assign this values to column final:
print(df1)
A B C final
0 a s z a
1 a m z a
2 a n z a
3 b i q b
4 d p q p
5 c i w NaN
6 d u l l
7 e y l l

How to fill NANs "ignoring" the index?

I have two dataframes like this:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(
{
'A': list('abdcde'),
'B': ['s', np.nan, 'h', 'j', np.nan, 'g']
}
)
df2 = pd.DataFrame(
{
'mapcol': list('abpppozl')
}
)
A B
0 a s
1 b NaN
2 d h
3 c j
4 d NaN
5 e g
mapcol
0 a
1 b
2 p
3 p
4 p
5 o
6 z
7 l
I would now like to fill B in df1 using the values of df2['mapcol'], however not using the actual index but - in this case - just the first two entries of df2['mapcol']. So, instead of b and p that correspond to index 1 and 4, respectively, I would like to use the values a and b.
One way of doing it would be to construct a dictionary with the correct indices and values:
df1['B_filled_incorrect'] = df1['B'].fillna(df2['mapcol'])
ind = df1[df1['B'].isna()].index
# reset_index is required as we might have a non-numerical index
val = df2.reset_index().loc[:len(ind-1), 'mapcol'].values
map_dict = dict(zip(ind, val))
df1['B_filled_correct'] = df1['B'].fillna(map_dict)
A B B_filled_incorrect B_filled_correct
0 a s s s
1 b NaN b a
2 d h h h
3 c j j j
4 d NaN p b
5 e g g g
which gives the desired output.
Is there a more straightforward way that avoids the creation of all these intermediate variables?
position fill you can assign the value via the loc and convert fill value to list
df1.loc[df1.B.isna(),'B']=df2.mapcol.iloc[:df1.B.isna().sum()].tolist()
df1
Out[232]:
A B
0 a s
1 b a
2 d h
3 c j
4 d b
5 e g

Categories

Resources