I need to add an integer-represented column in a pandas dataframe. For example if a have a dataframe with names and genders as the following:
I would need to add a new column with an integer value depending of the gender. Expected out put would be as follows:
df['Gender_code']=df['Gender'].transform(lambda gender: 1 if gender=='Female' else 0)
Explanation: Using transform(), you can apply a function to all values of any column. Here, I applied the function defined using lambda to column 'Gender'
For just two gender you can do a comparison:
df['Gender_code'] = df['Gender'].eq('Female').astype(int)
In the general case, you can resolve to factorize:
df['Gender_code'] = df['Gender'].factorize()[0]
Related
I have a DataFrame which has a few columns. There is a column with a value that only appears once in the entire dataframe. I want to write a function that returns the column name of the column with that specific value. I can manually find which column it is with the usual data exploration, but since I have multiple dataframes with the same properties, I need to be able to find that column for multiple dataframes. So a somewhat generalized function would be of better use.
The problem is that I don't know beforehand which column is the one I am looking for since in every dataframe the position of that particular column with that particular value is different. Also the desired columns in different dataframes have different names, so I cannot use something like df['my_column'] to extract the column.
Thanks
You'll need to iterate columns and look for the value:
def find_col_with_value(df, value):
for col in df:
if (df[col] == value).any():
return col
This will return the name of the first column that contains value. If value does not exist, it will return None.
Check the entire DataFrame for the specific value, checking any to see if it ever appears in a column, then slice the columns (or the DataFrame if you want the Series)
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.normal(0, 5, (100, 200)),
columns=[chr(i+40) for i in range(200)])
df.loc[5, 'Y'] = 'secret_value' # Secret value in column 'Y'
df.eq('secret_value').any().loc[lambda x: x].index
# or
df.columns[df.eq('secret_value').any()]
Index(['Y'], dtype='object')
I have another solution:
names = ds.columns
for i in names:
for j in ds[i]:
if j == 'your_value':
print(i)
break
Here you are collecting all the names of columns and then iterating all dataset while it will be found. Then print the name of column.
I have a pandas dataframe that looks like this:
I would like to generate counts instances of 'x' (regardless of whether they're unique, or not) per 'id'. The result would be insert as a column labeled 'x_count' as shown below:
Any tips would be helpful.
Simply a groupby with transform count
df['x_count'] = df.groupby('id')['x'].transform('count')
If you also want to count the NaN, use `size'
df['x_count'] = df.groupby('id')['x'].transform('size')
Try .value_counts with .map
df['x_count'] = df['id'].map(df.value_counts('id'))
I want to compare two different columns in a dataframe (called station_programming_df). I have one dataframe column that contains integers (called 'facility_id'). I have a second dataframe column that contains a dataframe object (which contains a series of integers)(called 'parsed_final_participant_val') . I want to see if the integer in the first column is in the column with the dataframe object (the second column). If true, I want to return a "1" in a new column (i.e., 'master_color')
I have tried various approaches including using python's "isin" function, which is not returning errors but is also not returning the correct amount. I have also attempted to convert the datatypes as well but with no luck.
station_programming_df['master_color']=np.where(station_programming_df['facility_id'].isin(station_programming_df['final_participants_val']),1,0 )
Here is what the dataframe that I am using looks like:
DATA:
facility_id,final_participants_val,master_color
35862,"62469,33894,33749,34847,21656,35396,4624,69571",0
35396,"62469,33894,33749,34847,21656,35396,4624,69571",0
While no error message is returned, I am not finding any matches. The second row should have returned a "1" in the master_color column.
I am wondering if it has to do with how it is interpreting the series (final_participants_val)
Any help would be really appreciated.
Use DataFrame.apply:
station_programming_df['master_color']=station_programming_df.apply(lambda x: 1 if str(x['facility_id']) in x['final_participants_val'] else 0,axis=1)
print(df)
facility_id final_participants_val master_color
0 35862 62469,33894,33749,34847,21656,35396,4624,69571 0
1 35396 62469,33894,33749,34847,21656,35396,4624,69571 1
You can use df.apply and in.
station_programming_df['master_color'] = station_programming_df.apply(lambda x: str(x.facility_id) in x.final_participants_val, axis=1)
facility_id final_participants_val master_color
0 35862 62469,33894,33749,34847,21656,35396,4624,69571 False
1 35396 62469,33894,33749,34847,21656,35396,4624,69571 True
Actually, I am facing the problem to add the data in the subcolumn in the specific format. I have created the "Polypoints" as the main column and I want
df["Polypoints"] = [{"__type":"Polygon","coordinates":Row_list}]
where Row_list is the column of dataframe which contains the data in the below format
df["Row_list"] = [[x1,y1],[x2,y2],[x3,y3]
[x1,y1],[x2,y2],[x3,y3]
[x1,y1],[x2,y2],[x3,y3]]
I want to convert the dataframe into json in the format
"Polypoints" :{"__type":"Polygon" ,"coordinates":Row_list}
There are various ways to do that.
One can create a function create_polygon that takes as input the dataframe (df), and the column name (columname). That would look like the following
def create_polygon(df, columnname):
return {"__type":"Polygon", "coordinates":df[columnname]}
Considering that the column name will be Row_list, the following will already be enough
def create_polygon(df):
return {"__type":"Polygon", "coordinates":df['Row_list']}
Then with pandas.DataFrame.apply one can apply it to the column Polypoints as follows
df['Polypoints'] = df.apply(create_polygon, axis=1)
As Itamar Mushkin mentions, one can also do it with a Lambda function as follows
df['Polypoints'] = df.apply(lambda row: {"__type":"Polygon", "coordinates":row['Row_list']} ,axis=1)
I was looking for a way to filter a df for value in a column in a groupby and also in another instance when calling that df column.
For example:
So to plot this dfs column_betas as below, but only when a different column (called column_value) has value like 2?
df['column_betas'] # ( when a different column called `column_value` is 2)
and for below when I am running a group by for the city column, but only when the column_value column = 2?
df.groupby(['City']).quantile(.5)
I am trying to avoid creating additional dfs that filter for a certain value for column_value and instead try to call that value when just calling that df for that specific column value or in the groupby.
This command gets df['column_betas"], where the value column is 2:
df[df["value"]==2]["column_betas"]
and this command does group by only on rows that has value of 2 in value column
df[df["value"]==2].groupby(["City"])
Substitute df with
df[df['column_value']==2]
So df['column_betas'] becomes df[df['column_value']==2]['column_betas']
and df.groupby(['City']).quantile(.5) becomes df[df['column_value']==2].groupby(['City']).quantile(.5)