pandas dataframe with identical column names - is it valid process? - python

I was able to produce a pandas dataframe with identical column names.
Is it this normal fro a pandas dataframe?
How can I choose one of the two columns only?
Using the identical name, it has, as a result, to produce as output both columns of the dataframe?
Example given below:
# Producing a new empty pd dataset
dataset=pd.DataFrame()
# fill in a list with values to be added to the dataset later
cases=[1]*10
# Adding the list of values in the dataset, and naming the variable / column
dataset["id"]=cases
# making a list of columns as it is displayed below:
data_columns = ["id", "id"]
# Then, we call the pd dataframe using the defined column names:
dataset_new=dataset[data_columns]
# dataset_new
# It has as a result two columns with identical names.
# How can I process only one of the two dataset columns?
id id
0 1 1
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 1
7 1 1

You can use the .iloc to access either column.
dataset_new.iloc[:,0]
or
dataset_new.iloc[:,1]
and of course you can rename your columns just like you did when you set them both to 'id' using:
dataset_new.column = ['id_1', 'id_2']

df = pd.DataFrame()
lst = ['1', '2', '3']
df[0] = lst
df[1] = lst
df.rename(columns={0:'id'}, inplace=True)
df.rename(columns={1:'id'}, inplace=True)
print(df[[1]])

Related

Assign counts from .count() to a dataframe + column names - pandas python

Hoping someone can help me here - i believe i am close to the solution.
I have a dataframe, of which i have am using .count() in order to return a series of all column names of my dataframe, and each of their respective non-NAN value counts.
Example dataframe:
feature_1
feature_2
1
1
2
NaN
3
2
4
NaN
5
3
Example result for .count() here would output a series that looks like:
feature_1 5
feature_2 3
I am now trying to get this data into a dataframe, with the column names "Feature" and "Count". To have the expected output look like this:
Feature
Count
feature_1
5
feature_2
3
I am using .to_frame() to push the series to a dataframe in order to add column names. Full code:
df = data.count()
df = df.to_frame()
df.columns = ['Feature', 'Count']
However receiving this error message - "ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements", as if though it is not recognising the actual column names (Feature) as a column with values.
How can i get it to recognise both Feature and Count columns to be able to add column names to them?
Add Series.reset_index instead Series.to_frame for 2 columns DataFrame - first column from index, second from values of Series:
df = data.count().reset_index()
df.columns = ['Feature', 'Count']
print (df)
Feature Count
0 feature_1 5
1 feature_2 3
Another solution with name parameter and Series.rename_axis or with DataFrame.set_axis:
df = data.count().rename_axis('Feature').reset_index(name='Count')
#alternative
df = data.count().reset_index().set_axis(['Feature', 'Count'], axis=1)
print (df)
Feature Count
0 feature_1 5
1 feature_2 3
This happens because your new dataframe has only one column (the column name is taken as series index, then translated into dataframe index with the func to_frame()). In order to assign a 2 elements list to df.columns you have to reset the index first:
df = data.count()
df = df.to_frame().reset_index()
df.columns = ['Feature', 'Count']

Add multi level column to dataframe

At the beginning, I'd like to add a multilevel column to an empty dataframe.
df = pd.DataFrame({"nodes": list(range(1, 5, 2))})
df.set_index("nodes", inplace=True)
So this is the dataframe to start with (still empty):
>>> df
nodes
1
3
Now I'd like to a first multilevel column.
I tried the following:
new_df = pd.DataFrame.from_dict(dict(zip(df.index, [1,2])), orient="index",
columns=["value"])
df = pd.concat([new_df], axis=1, keys=["test"])
Now the dataframe df looks like this:
>>> df
test
value
1 1
3 2
To add another column, i've done something similar.
new_df2 = pd.DataFrame.from_dict(dict(zip(df.index, [3,4])), orient="index",
columns=[("test2", "value2")])
df = pd.concat([df, new_df2], axis=1)
df.index.name = "nodes"
So the desired dataframe looks like this:
>>> df
test test2
nodes value value2
1 1 3
3 2 4
This way of adding multilevel columns seems a bit strange. Is there a better way of doing so?
Create a MultIndex on the columns by storing your DataFrames in a dict then concat along axis=1. The keys of the dict become levels of the column MultiIndex (if you use tuples it adds multiple levels depending on the length, scalar keys add a single level) and the DataFrame columns stay as is. Alignment is enforced on the row Index.
import pandas as pd
d = {}
d[('foo', 'bar')] = pd.DataFrame({'val': [1,2,3]}).rename_axis(index='nodes')
d[('foo2', 'bar2')] = pd.DataFrame({'val2': [4,5,6]}).rename_axis(index='nodes')
d[('foo2', 'bar1')] = pd.DataFrame({'val2': [7,8,9]}).rename_axis(index='nodes')
pd.concat(d, axis=1)
foo foo2
bar bar2 bar1
val val2 val2
nodes
0 1 4 7
1 2 5 8
2 3 6 9

How to add a MultiIndex after loading csv data into a pandas dataframe?

I am trying to add additional index rows to an existing pandas dataframe after loading csv data into it.
So let's say I load my data like this:
columns = ['Relative_Pressure','Volume_STP']
df = pd.read_csv(StringIO(contents), skiprows=4, delim_whitespace=True,index_col=False,header=None)
df.columns = columns
where contents is a string in csv format. The resulting DataFrame might look something like this:
For clarity reasons I would now like to add additional index rows to the DataFrame as shown here:
However in the link these multiple index rows are generated right when the DataFrame is created. I would like to add e.g. rows for unit or descr to the columns.
How could I do this?
You can create a MultiIndex on the columns by specifically creating the index and then assigning it to the columns separately from reading in the data.
I'll use the example from the link you provided. The first method is to create the MultiIndex when you make the dataframe:
df = pd.DataFrame({('A',1,'desc A'):[1,2,3],('B',2,'desc B'):[4,5,6]})
df.columns.names=['NAME','LENGTH','DESCRIPTION']
df
NAME A B
LENGTH 1 2
DESCRIPTION desc A desc B
0 1 4
1 2 5
2 3 6
As stated, this is not what you are after. Instead, you can make the dataframe (from your file for example) and then make the MultiIndex from a set of lists and then assign it to the columns:
df = pd.DataFrame({'desc A':[1,2,3], 'desc B':[4,5,6]})
# Output
desc A desc B
0 1 4
1 2 5
2 3 6
# Create a multiindex from lists
index = pd.MultiIndex.from_arrays((['A', 'B'], [1, 2], ['desc A', 'desc B']))
# Assign to the columns
df.columns = index
# Output
A B
1 2
desc A desc B
0 1 4
1 2 5
2 3 6
# Name the columns
df.columns.names = ['NAME','LENGTH','DESCRIPTION']
# Output
NAME A B
LENGTH 1 2
DESCRIPTION desc A desc B
0 1 4
1 2 5
2 3 6
There are other ways to construct a MultiIndex, for example, from_tuples and from_product. You can read more about Multi Indexes in the documentation.

How to copy one DataFrame column in to another Dataframe if their indexes values are the same

After creating a DataFrame with some duplicated cell values in column with the name 'keys':
import pandas as pd
df = pd.DataFrame({'keys': [1,2,2,3,3,3,3],'values':[1,2,3,4,5,6,7]})
I go ahead and create two more DataFrames which are the consolidated versions of the original DataFrame df. Those newly created DataFrames will have no duplicated cell values under the 'keys' column:
df_sum = df_a.groupby('keys', axis=0).sum().reset_index()
df_mean = df_b.groupby('keys', axis=0).mean().reset_index()
As you can see df_sum['values'] cells values were all summed together.
While df_mean['values'] cell values were averaged with mean() method.
Lastly I rename the 'values' column in both dataframes with:
df_sum.columns = ['keys', 'sums']
df_mean.columns = ['keys', 'means']
Now I would like to copy the df_mean['means'] column into the dataframe df_sum.
How to achieve this?
The Photoshoped image below illustrates the dataframe I would like to create. Both 'sums' and 'means' columns are merged into a single DataFrame:
There are several ways to do this. Using the merge function off the dataframe is the most efficient.
df_both = df_sum.merge(df_mean, how='left', on='keys')
df_both
Out[1]:
keys sums means
0 1 1 1.0
1 2 5 2.5
2 3 22 5.5
I think pandas.merge() is the function you are looking for. Like pd.merge(df_sum, df_mean, on = "keys"). Besides, this result can also be summarized on one agg function as following:
df.groupby('keys')['values'].agg(['sum', 'mean']).reset_index()
# keys sum mean
#0 1 1 1.0
#1 2 5 2.5
#2 3 22 5.5

Pandas indexing and accessing columns by names

I am trying to access pandas dataframe by column names after indexing the df with a specific column and it returns incorrect column values.
import pandas as pd
rs =pd.read_csv('rs.txt', header="infer", sep="\t", names=['id', 'exp','fov','cycle', 'color', 'values'], index_col=2)
rs.cycle.head()
I am indexing the df here with 'fov' and I want to access the 'cycle' column, it gives me the color column instead. I think I am missing something here?
EDIT
The first few lines of the input file are:
6 3 1 G 0.96593
6 3 1 O 0.88007
6 3 1 R 0.94305
6 3 2 B 0.90554
6 3 2 G 0.93146
I think the problem arises because your data file has 5 columns and your names list has 6 elements. To verify, check the first few values in the id column- these will all be set to 6 if I am right. The First few items in the exp column will have the value 3.
To fix this, read your input file like so:
rs =pd.read_csv('rs.txt', header="infer", sep="\t", names=['exp','fov','cycle', 'color', 'values'], index_col=2
Pandas will automatically insert row identifiers.

Categories

Resources