I have two columns.
The first column has the values A, B, C, D, and the the second column values corresponding to A, B, C, and D.
I'd like to convert/transpose A, B, C and D into 4 columns named A, B, C, D and have whatever values had previously corresponded to A, B, C, D (the original 2nd column) ordered beneath the respective column--A, B, C, or D. The original order must be preserved.
Here's an example.
Input:
A|1
B|2
C|3
D|4
A|3
B|6
C|3
D|6
Desired output:
A|B|C|D
1|2|3|4
3|6|3|6
Any ideas on how I can accomplish this using Pandas/Python?
Thanks a lot!
Very similar to pivoting with two columns (Q/A 10 here):
(df.assign(idx=df.groupby('col1').cumcount())
.pivot(index='idx', columns='col1', values='col2')
)
Output:
col1 A B C D
idx
0 1 2 3 4
1 3 6 3 6
To ensure your, you need to "capture" the order first, I am going to use the unique method for this situtaion:
Given df,
df = pd.DataFrame({'Col1':[*'ZCYBWA']*2, 'Col2':np.arange(12)})
Col1 Col2
0 Z 0
1 C 1
2 Y 2
3 B 3
4 W 4
5 A 5
6 Z 6
7 C 7
8 Y 8
9 B 9
10 W 10
11 A 11
Let's get order using unique:
order = df['Col1'].unique()
Then we can reshape using:
df.set_index([df.groupby('Col1').cumcount(), 'Col1'])['Col2'].unstack()
Col1 A B C W Y Z
0 5 3 1 4 2 0
1 11 9 7 10 8 6
But, adding reindex we can get original order:
df.set_index([df.groupby('Col1').cumcount(), 'Col1'])['Col2'].unstack().reindex(order, axis=1)
Col1 Z C Y B W A
0 0 1 2 3 4 5
1 6 7 8 9 10 11
Related
I have a DataFrame which has an array in one of the columns, some of the elements of this array correspond to elements in other rows, other don't.
Is there an easy way that I can split the array and add it to the columns that already have some entries, and create new columns for the rest?
Here's an example:
Count Tag Tag_Array
2 A [A]
3 B [B]
8 C [C]
4 - [A, C, D]
3 E [E]
And what I'd like to do is get the following dataframe:
Count Tag
6 A
3 B
12 C
4 D
3 E
Thanks a lot in advance!
Use, DataFrame.explode then GroupBy.sum
df.explode(column='Tag_Array').groupby('Tag_Array', as_index=False)['Count'].sum()
Tag_Array Count
0 A 6
1 B 3
2 C 12
3 D 4
4 E 3
explode then do the groupby with sum
df.explode('Tag_Array').groupby('Tag_Array').Count.sum().reset_index()
Out[237]:
Tag_Array Count
0 A 6
1 B 3
2 C 12
3 D 4
4 E 3
I'm new to python and dataframes so I am wondering if someone knows how I could accomplish the following. I have a dataframe with many columns, some which share a beginning and have an underscore followed by a number (bird_1, bird_2, bird_3). I want to essentially merge all of the columns that share a beginning into singular columns with all the values that were contained in the constituent columns. Then I'd like to run df[columns].value_counts for each.
Initial dataframe
Final dataframe
For df[bird].value_counts(), I would get a count of 1 for A-L
For df[cat].value_counts(), I would get a count of 3 for A, 4 for B, 1 for C
The ultimate goal is to get a count of unique values for each column type (bird, cat, dog, etc.)
You can do:
df.columns=[col.split("_")[0] for col in df.columns]
df=df.unstack().reset_index(1, drop=True).reset_index()
df["id"]=df.groupby("index").cumcount()
df=df.pivot(index="id", values=0, columns="index")
Outputs:
index bird cat
id
0 A A
1 B A
2 C A
3 D B
4 E B
5 F B
6 G B
7 H C
8 I NaN
9 J NaN
10 K NaN
11 L NaN
From there to get counts of all possible values:
df.T.stack().reset_index(1, drop=True).reset_index().groupby(["index", 0]).size()
Outputs:
index 0
bird A 1
B 1
C 1
D 1
E 1
F 1
G 1
H 1
I 1
J 1
K 1
L 1
cat A 3
B 4
C 1
dtype: int64
SUMMARY: The output of my code gives me a dataframe of the following format. The column headers of the dataframe are the labels for the text in the column Content. The labels will be used as training data for a multilabel classifier in the next step. This is a snippet of actual data which is much larger.
Since they are columns titles, it is not possible to use them as mapped to the text they are the labels for.
Content A B C D E
zxy 1 2 1
wvu 1 2 1
tsr 1 2 2
qpo 1 1 1
nml 2 2
kji 1 1 2
hgf 1 2
edc 1 2 1
UPDATE: Converting the df to csv shows the empty cells are blank('' vs ' '):
Where Content is the column where the text is, and A, B, C, D, and E are the column headers that need to be turned into the labels. Only columns with 1s or 2s are relevant. The column with empty cells are not relevant and thus don't need to be converted as labels.
UPDATE: After some digging, maybe the numbers might not be ints, but strings.
I know that when entering the text + labels into a classifier for processing, the length of both arrays needs to be equal, else it is not accepted as valid input.
Is there a way I can convert the columns titles to labels for the text in Content in the DF?
EXPECTED OUTPUT:
>>Content A B C D E Labels
0 zxy 1 2 1 A, B, D
1 wvu 1 2 1 A, C, D
2 tsr 1 2 2 A, B, E
3 qpo 1 1 1 B, C, D
4 nml 2 2 C, D
5 kji 1 1 2 A, C, E
6 hgf 1 2 C, E
7 edc 1 2 1 A, B, D
Full Solution:
# first: clear all whitespace before and after a char, fine for all columns
for col in df.columns:
df[col] = df[col].str.strip()
# fill na with 0
df.fillna(0, inplace=True)
# replace '' with 0
df.replace('', 0, inplace=True)
# convert to int, this must only be done on the specific columns with the numeric data
# this list is the column names as you've presented them, if they are different in the real data,
# replace them
for col in ['A', 'B', 'C', 'D', 'E']:
df = df.astype({col: 'int16'})
print(df.info())
# you should end up with something like this.
"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 6 columns):
Content 8 non-null object
A 8 non-null int16
B 8 non-null int16
C 8 non-null int16
D 8 non-null int16
E 8 non-null int16
dtypes: int16(5), object(1)
memory usage: 272.0+ bytes
"""
We can do dot, notice here, I treat the blanks as np.nan, if that is a real blank in your data, change the last line
# make certain the label names match the appropriate columns
s=df.loc[:, ['A', 'B', 'C', 'D', 'E']]
# or
s=df.loc[:,'A':]
df['Labels']=(s>0).dot(s.columns+',').str[:-1] # column A:E need to be numeric, not str
# df['Labels']=(~s.isin(['']).dot(s.columns+',').str[:-1]
Here's another way using np.where and groupby:
r, c = np.where(df>0)
df['Labels'] = pd.Series(df.columns[c], index=df.index[r]).groupby(level=[0, 1]).agg(', '.join)
Output:
A B C D E Labels
0 zxy 1 2 0 1 0 A, B, D
1 wvu 1 0 2 1 0 A, C, D
2 tsr 1 2 0 0 2 A, B, E
3 qpo 0 1 1 1 0 B, C, D
4 nml 0 0 2 2 0 C, D
5 kji 1 0 1 0 2 A, C, E
6 hgf 0 0 1 0 2 C, E
7 edc 1 2 0 1 0 A, B, D
You can also do that as follows:
# melt the two dimensional representation to
# a more or less onedimensional representation
df_flat= df.melt(id_vars=['Content'])
# filter out all rows which belong to empty cells
# the following is a fail-safe method, that should
# work for all datatypes you might encouter in your
# columns
df_flat= df_flat[~df_flat['value'].isna() & df_flat['value'] != 0]
df_flat= df_flat[~df_flat['value'].astype('str').str.strip().isin(['', 'nan'])]
# join the variables used per original row
df_flat.groupby(['Content']).agg({'variable': lambda ser: ', '.join(ser)})
The output looks like this:
variable
idx Content
0 zxy A, B, D
1 wvu A, C, D
2 tsr A, B, E
3 qpo B, C, D
4 nml C, D
5 kji A, C, E
6 hgf C, E
7 edc A, B, D
Given the following input data:
import pandas as pd
import io
raw="""idx Content A B C D E
0 zxy 1 2 1
1 wvu 1 2 1
2 tsr 1 2 2
3 qpo 1 1 1
4 nml 2 2
5 kji 1 1 2
6 hgf 1 2
7 edc 1 2 1 """
df= pd.read_fwf(io.StringIO(raw))
df.drop(['idx'], axis='columns', inplace=True)
Edit: I just removed 'idx' right after reading to create a structure like in the original dataframe and added some fail-safe code that works with different datatypes (the two lines below the melt-method). If more is known about how the missing values are actually represented, the code can be simplified.
Can't find this question anywhere, so just try here instead:
What I'm trying to do is basically alter an existing DataFrame object using groupby-functionality, and a self-written function:
benchmark =
x y z field_1
1 1 3 a
1 2 5 b
9 2 4 a
1 2 5 c
4 6 1 c
What I want to do, is to groupby field_1, apply a function using specific columns as input, in this case columns x and y, then add back the result to the original DataFrame benchmark as a new column called new_field. The function itself is dependent on the value in field_1, i.e. field_1=a will yield a different result compared to field_1=b etc. (hence the grouping to start with).
Pseudo-code would be something like:
1. grouped_data = benchmark.groupby(['field_1'])
2. apply own_function to grouped_data; with inputs ('x', 'y', grouped_data)
3. add back result from function to benchmark as column 'new_field'
Thanks,
ALTERATION:
benchmark =
x y z field_1
1 1 3 a
1 2 5 b
9 2 4 a
1 2 5 c
4 6 1 c
Elaboration:
I also have a DataFrame separate_data containing separate values for x,
separate_data =
x a b c
1 1 3 7
2 2 5 6
3 2 4 4
4 2 5 9
5 6 1 10
that will need to be interpolated onto the existing benchmark DataFrame. Which column in separate_data that should be used for interpolation is dependent on column field_1 in benchmark (i.e. values in set (a,b,c) above). The interpolated value in the new column, is based on x-value in benchmark.
Result:
benchmark =
x y z field_1 field_new
1 1 3 a interpolate using separate_data with x=1 and col=a
1 2 5 b interpolate using separate_data with x=1 and col=b
9 2 4 a ... etc
1 2 5 c ...
4 6 1 c ...
Makes sense?
EDIT:
I think you need reshape separate_data first by set_index + stack, set index names by rename_axis and set name of Serie by rename.
Then is possible groupby by both levels and use some function.
Then join it to benchmark with default left join:
separate_data1 =separate_data.set_index('x').stack().rename_axis(('x','field_1')).rename('d')
print (separate_data1)
x field_1
1 a 1
b 3
c 7
2 a 2
b 5
c 6
3 a 2
b 4
c 4
4 a 2
b 5
c 9
5 a 6
b 1
c 10
Name: d, dtype: int64
If necessary use some function, mainly if some duplicates in pairs x with field_1 it return nice unique pairs:
def func(x):
#sample function
return x / 2 + x ** 2
separate_data1 = separate_data1.groupby(level=['x','field_1']).apply(func)
print (separate_data1)
x field_1
1 a 1.5
b 10.5
c 52.5
2 a 5.0
b 27.5
c 39.0
3 a 5.0
b 18.0
c 18.0
4 a 5.0
b 27.5
c 85.5
5 a 39.0
b 1.5
c 105.0
Name: d, dtype: float64
benchmark = benchmark.join(separate_data1, on=['x','field_1'])
print (benchmark)
x y z field_1 d
0 1 1 3 a 1.5
1 1 2 5 b 10.5
2 9 2 4 a NaN
3 1 2 5 c 52.5
4 4 6 1 c 85.5
I think you cannot use transform because multiple columns which are read together.
So use apply:
df1 = benchmark.groupby(['field_1']).apply(func)
And then for new column are multiple solutions, e.g. use join (default left join) or map.
Sample solution with both method is here.
Or is possible use flexible apply which can return new DataFrame with new column.
Try something like this:
groups = benchmark.groupby(benchmark["field_1"])
benchmark = benchmark.join(groups.apply(your_function), on="field_1")
In your_function you would create the new column using the other columns that you need, e.g. average them, sum them, etc.
Documentation for apply.
Documentation for join.
Here is a working example:
# Sample function that sums x and y, then append the field as string.
def func(x, y, z):
return (x + y).astype(str) + z
benchmark['new_field'] = benchmark.groupby('field_1')\
.apply(lambda x: func(x['x'], x['y'], x['field_1']))\
.reset_index(level = 0, drop = True)
Result:
benchmark
Out[139]:
x y z field_1 new_field
0 1 1 3 a 2a
1 1 2 5 b 3b
2 9 2 4 a 11a
3 1 2 5 c 3c
4 4 6 1 c 10c
I have a dataframe below
df=pd.DataFrame({"A":np.random.randint(1,10,9),"B":np.random.randint(1,10,9),"C":list('abbcacded')})
A B C
0 9 6 a
1 2 2 b
2 1 9 b
3 8 2 c
4 7 6 a
5 3 5 c
6 1 3 d
7 9 9 e
8 3 4 d
I would like to get grouping result (with key="C" column) below,and the row c d and e is dropped intentionally.
number A_sum B_sum
a 2 16 15
b 2 3 11
this is 2row*3column dataframe. the grouping key is column C. And
The column "number"represents the count of each letter(a and b).
A_sum and B_sum represents grouping sum of letters in column C.
I guess we should use method groupby but how can I get this data summary table ?
You can do this using a single groupby with
res = df.groupby(df.C).agg({'A': 'sum', 'B': {'sum': 'sum', 'count': 'count'}})
res.columns = ['A_sum', 'B_sum', 'count']
One option is to count the size and sum the columns for each group separately and then join them by index:
df.groupby("C")['A'].agg({"number": 'size'}).join(df.groupby('C').sum())
number A B
# C
# a 2 11 8
# b 2 14 12
# c 2 8 5
# d 2 11 12
# e 1 7 2
You can also do df.groupby('C').agg(["sum", "size"]) which gives an extra duplicated size column, but if you are fine with that, it should also work.