Get subset of dataframe (python) based on requested column

Get subset of dataframe (python) based on requested column - python

I have following problem with a dataframe in python:
I have a dataframe with an ID column (which is not the index) and other columns.
Now I want to write a code that gives back a new dataframe with all rows that have the same value in columnx, as the requested item ID. It should also contain all columns of the dataframe df.
def subset(itemID):
columnxValue = df[df['ID'] == itemID]['columnx']
subset = df[df['columnx'] == columnxValue]
return subset
If I do it like this I always get the Error "Can only compare identically-labeled Series Objects
I changed the question to be more clear.

You can use .loc as follows:
def subset(itemID):
columnValueRequest = df.loc[df['ID'] == itemID, 'columnx'].iloc[0]
subset1 = df[df['columnx'] == columnValueRequest]
return subset1
As you want to get a value, instead of a Series for the variable columnValueRequest, you have to further use .iloc[0] to get the (first) value.

Do you mean something like this?
You give the ItemID as an argument to the subset function. Then it checks if the ItemID corresponds to the value in the ID column. It returns the value from columnx from the row where ItemID is equal to the value in the ID column.
def subset(itemID):
columnValueRequest = df[df['ID'] == itemID][columnx]
subset = df[df[columnx] == columnValueRequest]
return subset

Related

how to check value existing in pandas dataframe column value of type list

I have pandas dataframe which contains value in below format. How to filter dataframe which matches the 'd6d4e77e-b8ec-467a-ba06-1c6079aa2d82' in any of the value of type list part of PathDSC column
i tried
def get_projects_belongs_to_root_project(project_df, root_project_id):
filter_project_df = project_df.query("root_project_id in PathDSC")
it didn't work i got empty dataframe

Assuming the values of PathDSC column are lists of strings, you can check row-wise if each list contains the wanted value and mask those rows using Series.apply. Then select only those rows using boolean indexing.
def get_projects_belongs_to_root_project(project_df, root_project_id):
mask = project_df['PathDSC'].apply(lambda lst: root_project_id in lst)
filter_project_df = project_df[mask]
# ...

root_project_id = 'd6d4e77e-b8ec-467a-ba06-1c6079aa2d82'
df = df[df['PathDSC'].str.contains(root_project_id)]

Split a column of a dataframe into two separate columns

I'd like to split a column of a dataframe into two separate columns. Here is how my dataframe looks like (only the first 3 rows):
I'd like to split the column referenced_tweets into two columns: type and id in a way that for example, for the first row, the value of the type column would be replied_to and the value of id would be 1253050942716551168.
Here is what I've tried:
df[['type', 'id']] = df['referenced_tweets'].str.split(',', n=1, expand=True)
but I get the error:
ValueError: Columns must be the same length as key
(I think I get this error because the type in the referenced_tweets column is NOT always replied_to (e.g., it can be retweeted, and therefore, the lengths would be different)

Why not get the values from the dict and add it two new columns?
def unpack_column(df_series, key):
""" Function that unpacks the key value of your column and skips NaN values """
return [None if pd.isna(value) else value[0][key] for value in df_series]
df['type'] = unpack_column(df['referenced_tweets'], 'type')
df['id'] = unpack_column(df['referenced_tweets'], 'id')
or in a one-liner:
df[['type', 'id']] = df['referenced_tweets'].apply(lambda x: (x[0]['type'], x[0]['id']))

Store Value From df to Variable

I am trying to extract a value out of a dataframe and put it into a variable. Then later I will record that value into an Excel workbook.
First I run a SQL query and store into a df:
df = pd.read_sql(strSQL, conn)
I am looping through another list of items and looking them up in the df. They are connected by MMString in the df and MMConcat from the list of items I'm looping through.
dftemp = df.loc[df['MMString'] == MMConcat]
Category = dftemp['CategoryName'].item()
I get the following error at the last line of code above. ValueError: can only convert an array of size 1 to a Python scalar
In the debug console, when I run that last line of code but not store it to a variable, I get what looks like a string value. For example, 'Pickup Truck'.
How can I simply store the value that I'm looking up in the df to a variable?

Index by row and column with loc to return a series, then extract the first value via iat:
Category = df.loc[df['MMString'] == MMConcat, 'CategoryName'].iat[0]
Alternatively, get the first value from the NumPy array representation:
Category = df.loc[df['MMString'] == MMConcat, 'CategoryName'].values[0]
The docs aren't helpful, but pd.Series.item just calls np.ndarray.item and only works for a series with one value:
pd.Series([1]).item() # 1
pd.Series([1, 2]).item() # ValueError: can only convert an array of size 1

Selecting Pandas dataframe column

I am trying to use pandas data-frame as a parameter table which is loaded in the beginning of my application run.
Structure of the csv that is being loaded into the data-frame is as below :
param_name,param_value
source_dir,C:\Users\atiwari\Desktop\EDIFACT\source_dir
So the column names would be param_name and param_values.
How do i go about selecting the value from param_value where param_name == 'source_dir'?
I tried the below but it returns a data-frame with index not a string value:
param_df.loc[param_df['param_name']=='source_dir']['param_value']

It return Series:
s = param_df.loc[param_df['param_name']=='source_dir', 'param_value']
But if need DataFrame:
df = param_df.loc[param_df['param_name']=='source_dir', ['param_value']]
For scalar need convert Series by selecting by [] - select first value by 0. Also works iat.
Series.item need Series with values else get error if empty Series:
val = s.values[0]
val = s.iat[0]
val = s.item()

selecting a specific value from a data frame

I am trying to select a value from a dataframe. But the problem is the output is with data type and column name.
Here is my data frame which i am reading from a csv file,
Name,Code
blackberry,1
wineberry,2
rasberry,1
blueberry,1
mulberry,2
And here is my testing code-
dataFrame=pd.read_csv("test.csv")
value = dataFrame.loc[dataFrame['Name'] == 'rasberry']['Code']
print(value)
strvalue=str(value)
if(strvalue=="1"):
print("got it")
The expected ouput of value would be 1 but it is
2 1\nName: Code, dtype: int64
and that's why the if condition is not working. How can I get the specific value?
I am using pandas

The value you get is a Series object. You can use .iloc to extract the value from it:
value.iloc[0]
# 1
Or you can use .values to extract the underlying numpy array and then use index to extract the value:
value.values[0]
# 1

Break It Down
dataFrame['Name'] returns a pd.Series
dataFrame['Name'] == 'rasberry' returns a pd.Series with dtype bool
dataFrame.loc[dataFrame['Name'] == 'rasberry'] uses the boolean pd.Series to slice dataFrame returning a pd.DataFrame that is a subset of dataFrame
dataFrame.loc[dataFrame['Name'] == 'rasberry']['code'] is a pd.Series that is the column named 'code' in the sliced dataframe from step 3.
If you expect the elements in the 'Name' column to be unique, then this will be a one row pd.Series.
You want the element inside but at this point it's the difference between 'value' and ['value']
Setup
from io import StringIO
txt = """Name,Code
blackberry,1
wineberry,2
rasberry,1
blueberry,1
mulberry,2"""
Solution(s)
use iloc to grab first value
dataFrame=pd.read_csv(StringIO(txt))
value = dataFrame.query('Name == "rasberry"').Code.iloc[0]
print(value)
use iat to grab first value
dataFrame=pd.read_csv(StringIO(txt))
value = dataFrame.query('Name == "rasberry"').Code.iat[0]
print(value)
specify index column when reading in csv and use loc
dataFrame=pd.read_csv(StringIO(txt), index_col='Name')
value = dataFrame.loc['rasberry', 'Code']
print(value)
specify index column when reading in csv and use at
dataFrame=pd.read_csv(StringIO(txt), index_col='Name')
value = dataFrame.at['rasberry', 'Code']
print(value)
specify index column when reading in csv and use get_value
dataFrame=pd.read_csv(StringIO(txt), index_col='Name')
value = dataFrame.get_value('rasberry', 'Code')
print(value)
specify the index column when reading the csv and squeeze into a series if only one non index column exists
series=pd.read_csv(StringIO(txt), index_col='Name', squeeze=True)
value = series.rasberry
print(value)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get subset of dataframe (python) based on requested column - python

Related

how to check value existing in pandas dataframe column value of type list

Split a column of a dataframe into two separate columns

Store Value From df to Variable

Selecting Pandas dataframe column

selecting a specific value from a data frame

Categories

Resources