How to add one more column in dataframe - python

i am trying to add one more column in my existing dataframe
example
In[16] df1
Out[16]:
0
0 MDS31505B
But i want to add one more column which should be added at beginning at ) index column
In[16] df1
Out[16]:
0 1
0 SKUID MDS31505B
i am trying this but it does not affect
df1.insert(loc=0, column='0', value=SKUID)
name 'SKUID' is not defined

you can also try df1.insert(0,0,"skid")

Related

Pandas dataframe groupby and aggreagate with conditions

Is there a way where I can group my dataframe based on specific columns and include empty value as well but only when all of the values of the specific column is empty.
Example:
I have a dataframe that look like this:
I am trying to group the dataframe based on Name and Subject.
and my expected output looks like this:
So, if a person takes more than one subject but one of them is empty, then drop the row so when aggregating the other rows it wont be included. If a person takes only one subject and it is empty then dont drop the row
[Updated]
Original dataframe
Outcome will still be the same. It will takes the first row value if all subjects of a person is empty
[Updated] Another new dataframe
Outcome will have the same number of subjects but there will be 3 year
Here is a proposition with GroupBy.agg :
df = df.drop_duplicates(subset=["ID", "Name", "Subject"])
m = (df.groupby(["ID", "Name"])["Subject"].transform("size").gt(1)
& df["Subject"].isnull())
out = df.loc[~m].groupby(["ID", "Name"], as_index=False).agg(list)
Output :
​
print(out)
ID Name Subject Year
0 1 CC [Math, English] [1, 3]
1 2 DD [Physics] [2]
2 3 EE [Chemistry] [1]
3 4 FF [nan] [0]
4 5 GG [nan] [0]

df.index vs df["index"] after resetting index [duplicate]

This question already has an answer here:
Proper way to access a column of a pandas dataframe
(1 answer)
Closed last month.
import pandas as pd
df1 = pd.DataFrame({
"value": [1, 1, 1, 2, 2, 2]})
print(df1)
print("-------------------------")
print(df1.reset_index())
print("-------------------------")
print(df1.reset_index().index)
print("-------------------------")
print(df1.reset_index()["index"])
produces the output
value
0 1
1 1
2 1
3 2
4 2
5 2
-------------------------
index value
0 0 1
1 1 1
2 2 1
3 3 2
4 4 2
5 5 2
-------------------------
RangeIndex(start=0, stop=6, step=1)
-------------------------
0 0
1 1
2 2
3 3
4 4
5 5
Name: index, dtype: int64
I am wondering why print(df1.reset_index().index) and
print(df1.reset_index()["index"]) prints different things in this case? The latter prints the "index" column, while the former prints the indices.
If we want to access the reset indices (the column), then it seems we have to use brackets?
The .index attribute in a pandas DataFrame will always point to the Index (row label) of the DataFrame not a column named "index".
If we want to access the reset indices (the column), then it seems we
have to use brackets?
Yes, or you can assign a name when reseting the index for example:
df1.reset_index(names='the_index').the_index
# 0 0
# 1 1
# 2 2
# 3 3
# 4 4
# 5 5
# Name: the_index, dtype: int64
Several things happened. First, when you don't specify and index, pandas uses a RangeIndex object as a virtual index of the dataframe. The dataframe is a collection of numpy arrays which are naturally indexed from 0, 1, 2, and etc. Since RangeIndex is just 0, 1, etc... it doesn't actually create its values in memory. Had you printed the index of the original df1, it would be a RangeIndex, just like df1.reset_index().index.
reset_index has an optional drop parameter. By default, pandas will take the existing index and turn it into a column of the dataframe. This was a RangeIndex object but it had to be expanded into a realized column to fit with the other columns in the df. Had you included drop=True, there would be no "index" column.
When you reset the index, dataframes always have to have some index and the default is that virtual RangeIndex you see.
DataFrames have a shortcut where some columns can be addressed by attribute name rather than item (the square brackets). But, if the column name doesn't meet python's attribute naming rules or if it clashes with an existing attribute, you can't reference it that way. .index is the dataframe index so if you happen to also have a column "index", you need to access it via the square bracket item protocol.
One could argue that pandas should never have allowed the attribute access path because it can't be used consistently. I wouldn't argue that (except I totally would).
It does this because you are printing different things:
print(df1.reset_index().index)
is the same as:
df = df1.reset_index()
print(df.index)
This firstly adds an Id index to the dataframe then prints the actual index of the df.
print(df1.reset_index()["index"])
is the equivalent of
df = df1.reset_index()
print(df["index"])
It firstly adds an Id index to the dataframe but keeps both "index" and "values" columns. It then prints the Column named "Index" (which is NOT the index of the df)
If you want to make the "index" column the index, you must use:
df = df1.set_index("index")

Assign counts from .count() to a dataframe + column names - pandas python

Hoping someone can help me here - i believe i am close to the solution.
I have a dataframe, of which i have am using .count() in order to return a series of all column names of my dataframe, and each of their respective non-NAN value counts.
Example dataframe:
feature_1
feature_2
1
1
2
NaN
3
2
4
NaN
5
3
Example result for .count() here would output a series that looks like:
feature_1 5
feature_2 3
I am now trying to get this data into a dataframe, with the column names "Feature" and "Count". To have the expected output look like this:
Feature
Count
feature_1
5
feature_2
3
I am using .to_frame() to push the series to a dataframe in order to add column names. Full code:
df = data.count()
df = df.to_frame()
df.columns = ['Feature', 'Count']
However receiving this error message - "ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements", as if though it is not recognising the actual column names (Feature) as a column with values.
How can i get it to recognise both Feature and Count columns to be able to add column names to them?
Add Series.reset_index instead Series.to_frame for 2 columns DataFrame - first column from index, second from values of Series:
df = data.count().reset_index()
df.columns = ['Feature', 'Count']
print (df)
Feature Count
0 feature_1 5
1 feature_2 3
Another solution with name parameter and Series.rename_axis or with DataFrame.set_axis:
df = data.count().rename_axis('Feature').reset_index(name='Count')
#alternative
df = data.count().reset_index().set_axis(['Feature', 'Count'], axis=1)
print (df)
Feature Count
0 feature_1 5
1 feature_2 3
This happens because your new dataframe has only one column (the column name is taken as series index, then translated into dataframe index with the func to_frame()). In order to assign a 2 elements list to df.columns you have to reset the index first:
df = data.count()
df = df.to_frame().reset_index()
df.columns = ['Feature', 'Count']

append or join value from one dataframe to every row in another dataframe in Pandas

I'm normally OK on the joining and appending front, but this one has got me stumped.
I've got one dataframe with only one row in it. I have another with multiple rows. I want to append the value from one of the columns of my first dataframe to every row of my second.
df1:
id
Value
1
word
df2:
id
data
1
a
2
b
3
c
Output I'm seeking:
df2
id
data
Value
1
a
word
2
b
word
3
c
word
I figured that this was along the right lines, but it listed out NaN for all rows:
df2 = df2.append(df1[df1['Value'] == 1])
I guess I could just join on the id value and then copy the value to all rows, but I assumed there was a cleaner way to do this.
Thanks in advance for any help you can provide!
Just get the first element in the value column of df1 and assign it to value column of df2
df2['value'] = df1.loc[0, 'value']

How to get the index value of column value compared with another column value

I want something like this.
Index Sentence
0 I
1 want
2 like
3 this
Keyword Index
want 1
this 3
I tried with df.index("Keyword") but its not giving for all the rows. It will be really helpful if someone solve this.
Use isin with boolean indexing only:
df = df[df['Sentence'].isin(['want', 'this'])]
print (df)
Index Sentence
1 1 want
3 3 this
EDIT: If need compare by another column:
df = df[df['Sentence'].isin(df['Keyword'])]
#another DataFrame df2
#df = df[df['Sentence'].isin(df2['Keyword'])]
And if need index values:
idx = df.index[df['Sentence'].isin(df['Keyword'])]
#alternative
#idx = df[df['Sentence'].isin(df['Keyword'])].index

Categories

Resources