I would like to know how to make a new row based on the column names row in a python dataframe, and append it to the same dataframe.
example
df = pd.DataFrame(np.random.randn(10, 5),columns=['abx', 'bbx', 'cbx', 'acx', 'bcx'])
I want to create a new row based on the column names that gives:
b | b | b | c | c |by taking the middle char of the column name.
the idea is to use that new row, later, for multi-indexing the columns.
I'm assuming this is what you want as you've not responded, we can append a new row by creating a dict from zipping the df columns and a list comprehension of the middle character (assuming that column name lengths are 3):
In [126]:
df.append(dict(zip(df.columns, [col[1] for col in df])), ignore_index=True)
Out[126]:
abx bbx cbx acx bcx
0 -0.373421 -0.1005462 -0.8280985 -0.1593167 1.335307
1 1.324328 -0.6189612 -0.743703 0.9419248 1.282682
2 0.3730312 -0.06697892 1.113707 -0.9691056 1.779643
3 -0.6644958 1.379606 -0.3751724 -1.135034 0.3287292
4 0.4406139 -0.5767996 -0.2267589 -1.384412 -0.03038372
5 -1.242734 -0.838923 -0.6724592 1.405247 -0.3716862
6 -1.682637 -1.69309 -1.291833 1.781704 0.6321988
7 -0.5793783 -0.6809975 1.03502 -0.6498381 -1.124236
8 1.589016 1.272961 -1.968225 0.5515182 0.3058628
9 -2.275342 2.892237 2.076253 -0.1422845 -0.09776171
10 b b b c c
ix --- lets you read the entire row-- you just say which ever row you want.
then you get your columns and assign them to the raw you want.
See the example below.
virData = DataFrame(df)
virData.columns = virData.ix[1].values
virData.columns
Related
DF1 =[
Column A
Column B
Cell 1
Cell 2
Cell 3
Cell 4
Column A
Column B
Cell 1
Cell 2
Cell 3
Cell 4
]
DF2 = [ NY, FL ]
in this case DF1 and DF2 have two indexes.
The result I am looking for is the following
Main_DF =
[
Column A
Column B
Column C
Cell 1
Cell 2
NY
Cell 3
Cell 4
NY
Column A
Column B
Column C
Cell 1
Cell 2
FL
Cell 3
Cell 4
FL
]
I tried to use pd.concat, assign and insert
none give me the way I'm looking for the result to be
Lists hold references to dataframes. So, you can amend the dataframes and not need to amend the list at all.
So, I'd do something like...
for df, val in zip(DF1, DF2):
df['Column C'] = val
Using zip allows you to iterate though the two lists in sync with each other;
1st element of DF1 goes in to df, and 1st element of DF2 goes into val
2nd element of DF1 goes in to df, and 2nd element of DF2 goes into val
and so on
I want to extract whatever come after -
My data in column A looks like
Column A
Column B
001-3
5
002-14
6
what I want is
Column A
Column B
Column C
001-3
3
5
002-14
14
6
Is there any function like scan for SAS in python, so I can extract the what come after "-" in column A and place it at column B, after move B to C
# reassign Column B to Column C
df['Column C'] = df['Column B']
#split from right and limit to only single split, then take the right value to create column B
df['Column B']=df['Column A'].str.rsplit('-',n=1,expand=True)[1]
df
Column A Column B Column C
0 001-3 3 5
1 002-14 14 6
I'm trying to copy data from different columns to a particular column in the same DataFrame.
Index
col1A
col2A
colB
list
CT
CW
CH
0
1
:
1
b
2
2
3
3d
But prior to that I wanted to search if those columns(col1A,col2A,colB) exist in the DataFrame and group those columns which are present and move the grouped data to relevant columns(CT,CH,etc) like,
CH
CW
CT
0
1
1
1
b
b
2
2
2
3
3d
3d
I did,
col_list1 = ['ColA','ColB','ColC']
test1 = any([ i in df.columns for i in col_list1 ])
if test1==True:
df['CH'] = df['Col1A'] +df['Col2A']
df['CT'] = df['ColB']
this code is throwing me a keyerror
.
I want it to ignore columns that are not present and add only those that are present
IIUC, you can use Python set or Series.isin to find the common columns
cols = list(set(col_list1) & set(df.columns))
# or
cols = df.columns[df.columns.isin(col_list1)]
df['CH'] = df[cols].sum(axis=1)
Instead of just concatenating the columns with +, collect them into a list and use sum with axis=1:
df['CH'] = np.sum([df[c] for c in cl if c in df], axis=1)
I have a big matrix, like this:
df:
A A A B B ... (column names)
A 2 4 5 9 2
A 6 8 7 6 4
A 5 2 6 4 5
B 3 4 1 3 4
B 4 5 3 1 4
.
.
(row names)
I would like to merge the columns with same name, and findig the minimum value. At the end I would like to have a matrix like this:
df_min:
A B ... (column names)
A 2 2
A 6 4
A 2 4
B 1 3
B 3 1
.
.
(row names)
My intentions, afterwards (outside of the question), is to merge the rows as well. Desired outcome:
df_min:
A B ... (column names)
A 2 2
B 1 1
.
.
(row names)
I tried this:
df_min= df.groupby('df.columns, axis=1').agg(np.min)
But it didn't work, it removed some rows (for example, removing entirely row A)... EDIT: Apparently, it worked fine but I had two columns with different names but whitespace at the end of the name. These methods reorder the columns, which confused me.
A snipped of the dataframe:
Simply groupby on the level=0 for each axis:
df.groupby(level=0, axis=1).min()
output:
A B
A 2 2
A 6 4
A 2 4
B 1 3
B 3 1
both axes:
df.groupby(level=0, axis=1).min().groupby(level=0).min()
output:
A B
A 2 2
B 1 1
Alternatively, use a single groupby trough a stack/unstack:
df.stack().groupby(level=[0,1]).min().unstack()
output:
A B
A 2 2
B 1 1
EDIT
numpy only based solution
I'm assuming that you have a list associating names to column indices, e.g. for the first code sample you provided something like
column_names = ['A', 'A', 'A', 'B', 'B']
and that your data type is single-precision floating point. In this scenario, you can do something like the following:
unique_column_names = list(dict.fromkeys(column_names)) # get unique column names preserving original order
df_min = np.empty((df.shape[0], len(unique_column_names), dtype=np.float32) # allocate output array
for i, column_name in enumerate(unique_column_names): # iterate over unique column names
column_indices = [id for id in range(df.shape[1]) if column_names[id] == column_name] # extract all column indices having the same name
tmp = df[:, column_indices] # extract columns named as column_name
df_min[:, i] = np.amin(tmp, axis=1)] # take min by row and save result
Then, if you want to repeat the process by row, assuming you have another list associating row indices and names named row_names
unique_row_names = list(dict.fromkeys(row_names)) # get unique row names preserving order
df_final = np.empty((len(unique_row_names), len(unique_column_names), dtype=np.float32) # allocate final output
for j, row_name in enumerate(unique_row_names): # iterate over unique row names
row_indices = [id for id in range(df.shape[0]) if row_names[id] == row_name] # extract rows having row_name
tmp = df_min[row_indices, :] # extract rows named as row_name from the column-reduced matrix
df_final[j, :] = np.amin(tmp, axis=0) # take min by column and save result
The column-name and row-name association list for the final output are unique_column_names and unique_row_names
I am trying to add additional index rows to an existing pandas dataframe after loading csv data into it.
So let's say I load my data like this:
columns = ['Relative_Pressure','Volume_STP']
df = pd.read_csv(StringIO(contents), skiprows=4, delim_whitespace=True,index_col=False,header=None)
df.columns = columns
where contents is a string in csv format. The resulting DataFrame might look something like this:
For clarity reasons I would now like to add additional index rows to the DataFrame as shown here:
However in the link these multiple index rows are generated right when the DataFrame is created. I would like to add e.g. rows for unit or descr to the columns.
How could I do this?
You can create a MultiIndex on the columns by specifically creating the index and then assigning it to the columns separately from reading in the data.
I'll use the example from the link you provided. The first method is to create the MultiIndex when you make the dataframe:
df = pd.DataFrame({('A',1,'desc A'):[1,2,3],('B',2,'desc B'):[4,5,6]})
df.columns.names=['NAME','LENGTH','DESCRIPTION']
df
NAME A B
LENGTH 1 2
DESCRIPTION desc A desc B
0 1 4
1 2 5
2 3 6
As stated, this is not what you are after. Instead, you can make the dataframe (from your file for example) and then make the MultiIndex from a set of lists and then assign it to the columns:
df = pd.DataFrame({'desc A':[1,2,3], 'desc B':[4,5,6]})
# Output
desc A desc B
0 1 4
1 2 5
2 3 6
# Create a multiindex from lists
index = pd.MultiIndex.from_arrays((['A', 'B'], [1, 2], ['desc A', 'desc B']))
# Assign to the columns
df.columns = index
# Output
A B
1 2
desc A desc B
0 1 4
1 2 5
2 3 6
# Name the columns
df.columns.names = ['NAME','LENGTH','DESCRIPTION']
# Output
NAME A B
LENGTH 1 2
DESCRIPTION desc A desc B
0 1 4
1 2 5
2 3 6
There are other ways to construct a MultiIndex, for example, from_tuples and from_product. You can read more about Multi Indexes in the documentation.