I have a CSV file with company data with 22 rows and 6500 columns. The columns have the same names and I should get the columns with the same names stacked into individual columns according to their headers.
I have now the data in one df like this:
Y C Y C Y C
1. a 1. b. 1. c.
2. a. 2. b. 2. c.
and I need to get it like this:
Y C
1. a.
2. a.
1. b.
2. b.
1. c.
2. c.
I would try an attempt where you slice the df in chunks by iteration and concat them back together, since the column names can't be identified distinctly.
EDIT
Changed answer to new input:
chunksize = 2
df = (
pd.concat(
[
df.iloc[:, i:i+chunksize] for i in range(0, len(df.columns), chunksize)
]
)
.reset_index(drop=True))
print(df)
Y C
0 1 a
1 2 a
2 1 b
3 2 b
4 1 c
5 2 c
I couldn't resist looking for a solution.
The best I found so far accounts for the fact that pd.read_csv addresses repeated column names by appending '.N' to the duplicates.
In [2]: df = pd.read_csv('duplicate_columns.csv')
In [3]: df
Out[3]:
1 2 3 4 1.1 2.1 3.1 4.1 1.2 2.2 3.2 4.2
0 a q j e w e r t y u d s
1 b w w f c e f g d c s a
2 d q e h c f b f a w q r
To put your data into the same column...
Group the columns by their original names.
Apply a flattener to convert to a series of arrays.
Create a new data frame from the series viewed as a dict.
In [3]: grouper = lambda l: l.split('.')[0] # peels off added suffix
In [4]: flattener = lambda v: v.stack().values # reshape groups
In [4]: pd.DataFrame(df.groupby(by=grouper, axis='columns')
...: .apply(flattener)
...: .to_dict())
Out[4]:
1 2 3 4
0 a q j e
1 w e r t
2 y u d s
3 b w w f
4 c e f g
5 d c s a
6 d q e h
7 c f b f
8 a w q r
I'd love to see a cleaner, less obtuse, general solution.
Related
I have two dataframes, example:
Df1 -
A B C D
x j 5 2
y k 7 3
z l 9 4
Df2 -
A B C D
z o 1 1
x p 2 1
y q 3 1
I want to deduct columns C and D in Df2 from columns C and D in Df1 based on the key contained in column A.
I also want to ensure that column B remains untouched, example:
Df3 -
A B C D
x j 3 1
y k 4 2
z l 8 3
I found an almost perfect answer in the following thread:
Subtracting columns based on key column in pandas dataframe
However what the answer does not explain is if there are other columns in the primary df (such as column B) that should not be involved as an index or with the operation.
Is somebody please able to advise?
I was originally performing a loop which find the value in the other df and deducts it however this takes too long for my code to run with the size of data I am working with.
Idea is specify column(s) for maching and column(s) for subtract, convert all not cols columnsnames to MultiIndex, subtract:
match = ['A']
cols = ['C','D']
df1 = Df1.set_index(match + Df1.columns.difference(match + cols).tolist())
df = df1.sub(Df2.set_index(match)[cols], level=0).reset_index()
print (df)
A B C D
0 x j 3 1
1 y k 4 2
2 z l 8 3
Or replace not matched values to original Df1:
match = ['A']
cols = ['C','D']
df1 = Df1.set_index(match)
df = df1.sub(Df2.set_index(match)[cols], level=0).reset_index().fillna(Df1)
print (df)
A B C D
0 x j 3 1
1 y k 4 2
2 z l 8 3
I have this data in a dataframe, The code column has several values and is of object datatype.
I want to split the rows in the following way
result
I tried to change the datatype by using
df['Code'] = df['Code'].astype(str)
and then tried to split the commas and reset the index on the basis of ID (unique) but I only get two column values. I need the entire dataset.
df = (pd.DataFrame(df.Code.str.split(',').tolist(), index=df.ID).stack()).reset_index([0, 'ID'])
df.columns = ['ID', 'Code']
Can someone help me out? I don't understand how to twist this code.
Attaching the setup code:
import pandas as pd
x = {'ID': ['1','2','3','4','5','6','7'],
'A': ['a','b','c','a','b','b','c'],
'B': ['z','x','y','x','y','z','x'],
'C': ['s','d','w','','s','s','s'],
'D': ['m','j','j','h','m','h','h'],
'Code': ['AB,BC,A','AD,KL','AD,KL','AB,BC','A','A','B']
}
df = pd.DataFrame(x, columns = ['ID', 'A','B','C','D','Code'])
df
You can first split Code column on comma , then explode it to get the desired output.
df['Code']=df['Code'].str.split(',')
df=df.explode('Code')
OUTPUT:
ID A B C D Code
0 1 a z s m AB
0 1 a z s m BC
0 1 a z s m A
1 2 b x d j AD
1 2 b x d j KL
2 3 c y w j AD
2 3 c y w j KL
3 4 a x h AB
3 4 a x h BC
4 5 b y s m A
5 6 b z s h A
6 7 c x s h B
If needed, you can replace empty string by NaN
So here is my problem.
I'm using pandas to parse csv file.
So my csv file looks like this :
A B C D
1 x 5 e
2 y 6 f
3 z 7 g
What I want to get is :
get all the values of column C
Place them under column A
Same with columns D and B
So it would get me this :
A B C D
1 x
2 y
3 z
5 e
6 f
7 g
However, all i've been able to get is to create a new column that "sums" column A with column C and column B with column D:
A B C D E F
1 x 5 e 15 xe
2 y 6 f 26 yf
3 z 7 g 37 zg
Any idea would be appreciated.
Thanks
Rename column C and D and append them to the bottom of columns A and B`:
result = df[['A', 'B']].append(df[['C','D']].set_axis(['A', 'B'], axis=1)).reset_index(drop=True)
I have two dataframes like this:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(
{
'A': list('aaabdcde'),
'B': list('smnipiuy'),
'C': list('zzzqqwll')
}
)
df2 = pd.DataFrame(
{
'mapcol': list('abpppozl')
}
)
A B C
0 a s z
1 a m z
2 a n z
3 b i q
4 d p q
5 c i w
6 d u l
7 e y l
mapcol
0 a
1 b
2 p
3 p
4 p
5 o
6 z
7 l
Now I want to create an additional column in df1 which should be filled with values coming from the columns A, B and C respectively, depending on whether their values can be found in df2['mapcol']. If the values in one row can be found in more than one column, they should be first used from A, then B and then C, so my expected outcome looks like this:
A B C final
0 a s z a # <- values can be found in A and C, but A is preferred
1 a m z a # <- values can be found in A and C, but A is preferred
2 a n z a # <- values can be found in A and C, but A is preferred
3 b i q b # <- value can be found in A
4 d p q p # <- value can be found in B
5 c i w NaN # none of the values can be mapped
6 d u l l # value can be found in C
7 e y l l # value can be found in C
A straightforward implementation could look like this (filling the column final iteratively using fillna in the preferred order):
preferred_order = ['A', 'B', 'C']
df1['final'] = np.nan
for col in preferred_order:
df1['final'] = df1['final'].fillna(df1[col][df1[col].isin(df2['mapcol'])])
which gives the desired outcome.
Does anyone see a solution that avoids the loop?
you can use where and isin on the full dataframe df1 to mask the value not in the df2, then reorder with the preferred_order and bfill along the column, keep the first column with iloc
preferred_order = ['A', 'B', 'C']
df1['final'] = (df1.where(df1.isin(df2['mapcol'].to_numpy()))
[preferred_order]
.bfill(axis=1)
.iloc[:, 0]
)
print (df1)
A B C final
0 a s z a
1 a m z a
2 a n z a
3 b i q b
4 d p q p
5 c i w NaN
6 d u l l
7 e y l l
Use:
order = ['A', 'B', 'C'] # order of columns
d = df1[order].isin(df2['mapcol'].tolist()).loc[lambda x: x.any(axis=1)].idxmax(axis=1)
df1.loc[d.index, 'final'] = df1.lookup(d.index, d)
Details:
Use DataFrame.isin and filter the rows using boolean masking with DataFrame.any along axis=1 then use DataFrame.idxmax along axis=1 to get column names names associated with max values along axis=1.
print(d)
0 A
1 A
2 A
3 A
4 B
6 C
7 C
dtype: object
Use DataFrame.lookup to lookup the values in df1 corresponding to the index and columns of d and assign this values to column final:
print(df1)
A B C final
0 a s z a
1 a m z a
2 a n z a
3 b i q b
4 d p q p
5 c i w NaN
6 d u l l
7 e y l l
I have a dataframe like this one (basically two columns: first contains blogger id and second one contains followers):
blogger follower
A c
A d
A e
A f
A g
A h
A i
A j
A k
B c
B f
B g
B l
B m
B n
B o
B p
B q
B r
B s
B t
B k
C a
C k
C r
C g
C t
C c
C p
C y
C z
C w
What I want to get is a square matrix with all-to-all intersection count, like this:
A B C
A - 4 3
B 4 - 6
C 3 6 -
I'm not a skilled pandas user and all I achieved is doing this by using 2 loops and np.intersect which I believe is not efficient. I've been trying to play with pivot_table(), crosstab() and groupby() - no luck, so unfortunately there is no code to share. Maybe someone here knows an efficient solution?
Perform a self-merge, followed by crosstabulation operation.
i = df.merge(df, on='follower')
j = pd.crosstab(i.blogger_x, i.blogger_y)
j
blogger_y A B C
blogger_x
A 9 4 3
B 4 13 6
C 3 6 10
Of course, the diagonals aren't -, but that's easy.
j = j.astype(object)
j.values[[np.arange(j.shape[0])] * 2] = '-'
j
blogger_y A B C
blogger_x
A - 4 3
B 4 - 6
C 3 6 -
Note that this ruins performance, because your columns are now object type, which is the only way to mix values of different types in the same column.