I am working on a basic code - my aim is to use gender_guesser.detector to find the gender for rows in a dataframe where the imported file (from CSV) would be missing those values. For simplicity, I just created a dummy dataframe in the code below.
I am pretty new to Python and very much in the learning phase so I assume that there are definitely more elegant solutions to what I am trying to do. My idea was to add a new column, find the values for each row using the above mentioned function and then fill out the NaN values while keeping the original gender values where applicable (dropping the temp column once it's done).
The d.get_gender part works if I manually apply it to a specific row and Jupyter accepts the function as well.
df = pd.DataFrame([['Adam','Smith',''],['Lindsay','Jackson','M'],['Laura','Jones','F'],['Arthur','Jackson','']] ,columns=['first_name','last_name','gender'])
import gender_guesser.detector as gender
df['newgender']=""
def findgender(dataframe):
for row in dataframe:
d = gender.Detector()
df.loc[row, 'newgender'] = d.get_gender(df.loc[row,'first_name'])
return df
df.apply(findgender, axis=1)
When I then try to apply this to my dataframe, I get a lengthy error message, the last line being
KeyError: ('Adam', 'occurred at index 0')
I tried to look up similar posts here but for most, adding axis=1 solved the issue - as I already have it, I am clueless why the code is not working.
Any help or explanation on why the issue is occurring would be extremely helpful.
I'm not sure why you are getting that error. Usually, it is better to avoid accessing a dataframe line by line. The following solution seems to work using a lambda function.
import pandas as pd
import gender_guesser.detector as gender
df = pd.DataFrame([['Adam','Smith',''],['Lindsay','Jackson','M'],['Laura','Jones','F'],['Arthur','Jackson','']] ,columns=['first_name','last_name','gender'])
df['newgender'] = df['first_name'].apply(lambda x: gender.Detector().get_gender(x))
It produces the following result.
first_name last_name gender newgender
0 Adam Smith male
1 Lindsay Jackson M mostly_female
2 Laura Jones F female
3 Arthur Jackson male
Related
I am still learning coding and would be grateful for any help that I can get.
I have a dataframe where a person's name is the column header. I would like to shift the column header down 1 row and rename the column 'Name'. The column header will be different with each dataframe, though. It won't always be the same person's name.
Here is an example of one of the dataframes:
index Patrick
0 Stan
1 Frank
2 Emily
3 Tami
I am hoping to shift the entire column with names down 1 row and rename the column header 'Names' but am unable to find this during my research.
Here is an example of the desired output:
index Names
0 Patrick
1 Stan
2 Frank
3 Emily
4 Tami
I have seen the option to use 'shift', however, I do not know if this will work correctly in this case.
I will post the url below for a similar problem that was been asked. In this problem they want to shift up by one. In my problem I want to shift down by one.
Shift column in pandas dataframe up by one?
The code example I found online is below:
df.gdp = df.gdp.shift(-1)
The issue I see is that I will be using multiple dataframes so the person's name (column header) will be different. I won't have the same name each time.
Any help? I will continue to work and this and will post the answer if I am able to find it. Thanks in advance for any help that you may offer.
Maybe this helps:
df = pd.DataFrame({'index': range(4), 'Patrick': ['Stan', 'Frank', 'Emily', 'Tami']})
names = pd.concat([pd.Series(df.columns[1]),df.iloc[:, 1]]).reset_index(drop=True)
df = pd.DataFrame({'Names': names})
df
index Names
0 Patrick
1 Stan
2 Frank
3 Emily
4 Tami
I'm working on titanic data right now, using pandas. Funny thing is, when dealing with missing values, drorpna() does not work but notna() does.
temp.Embarked.dropna(inplace = True)
temp.isnull().sum()
Embarked 2
temp = temp[temp['Embarked'].notna()]
temp.isnull().sum()
Embarked 0
I think both done same process, but when we are using dropna() we have to mention how the way we have to drop Nan means, by axis....
you have to mention row wise or column wise
Eg:temp.Embarked.dropna(inplace = True,axis=1)
it will drop the nan values with entire row
for further clarification please refer this link below:
Here's What does axis in pandas mean?
Given a Pandas dataframe such as:
Name Age
John 20
Mary 65
Bob 55
I wish to iterate over the rows, decide whether each person is a senior (age>=60) or not, create a new entry with an extra column, then append that to a csv file such that it (the csv file) reads as follows:
Name Age Senior
John 20 False
Mary 65 True
Bob 55 False
Other than saving the data to a csv, I am able to do the rest by turning the series the loop is currently iterating over to a dictionary then adding a new key.
for idx, e in records.iterrows():
entry = e.to_dict()
entry["senior"] = (entry["age"]<60)
Simply converting dict to series to dataframe isnt writing it to the csv file properly. Is there a pandas or non-pandas way of making this work?
IMPORTANT EDIT : The above is a simplified example, I am dealing with hundreds of rows and the data I want to add is a long string that will be created during run time, so looping is mandatory. Also, adding that to the original dataframe isnt an option as I am pretty sure Ill run out of program memory at some point (so I cant add the data to the original dataframe nor create a new dataframe with all the information). I dont want to add the data to the original dataframe, only to a copy of a "row" that will then be appended to a csv.
The example is given to provide some context for my question, but the main focus should be on the question, not the example.
Loops here are not necessary, only assign new column by compare with scalar and for avoid create columns in original DataFrame use DataFrame.assign - it return new DataFrame with new column and original is not changed:
df1 = df.assign(senior = df["age"]>=60)
EDIT:
If really need loops (not recommended):
for idx, e in df.iterrows():
df.loc[idx, "senior"] = e["Age"]>=60
print (df)
Name Age senior
0 John 20 False
1 Mary 65 True
2 Bob 55 False
Also you can use ge:
df2 = df.copy()
df2['senior'] = df2['Age'].ge(60)
And now:
print(df2)
Output:
Name Age senior
0 John 20 False
1 Mary 65 True
2 Bob 55 False
use np.where
import numpy as np
df1 = df.copy()
df1['Senior'] = np.where(df1['Age']>60,True,False)
Found the answer I needed here: Convert a dictionary to a pandas dataframe
Code:
first_entry=True
for idx, e in records.iterrows():
entry = e.to_dict()
entry["senior"] = (entry["age"]<60)
df_entry = pd.DataFrame([entry], columns=entry.keys())
df_entry.to_csv(output_path, sep=',', index=False, columns=header,header=first_entry,mode='a')
#output_path is a variable with path to csv, header is a variable with list of new column names
first_entry=False
Was hoping for a better way to do it, but this one works fine.
Please excuse me if this question is too n00bish, I am brand new to Python and need to use it for work, which unfortunately means diving into higher level stuff without first understanding the basics...
I have a massive CSV with text transcripts which I read into a pandas dataframe. These transcripts are broken down into IDs and the ID's must be grouped to get a singular record for each interaction as they are broken apart into segments in the original database they come from. The format is something like this:
ID TEXT
1 This is the beginning of a convo
1 heres the middle
1 heres the end of the convo
2 this is the start of another convo...etc.
I used this code to group by ID and create singular records:
df1 = df.groupby('ID').text.apply(' '.join)
This code worked great but now I am stuck with a series (?) that no longer recognizes the index "ID", I think it's been merged with the text or something. When I use to_frame() the problem remains. I am wondering how I might separate the ID again and use that to index the data?
The groupby will return groupby-ed column as the index. Looking at your code this is what I see.
import pandas as pd
df = pd.DataFrame({'ID':[1,1,1,2],
'TEXT':['This is the beginning of a convo', 'heres the
middle', 'heres the end of the convo', 'this is the
start of another convo...etc.']})
df1 = df.groupby('ID').TEXT.apply(' '.join)
print(df1)
ID
1 This is the beginning of a convo heres the mid...
2 this is the start of another convo...etc.
Name: TEXT, dtype: object
You can take the series df1 and re-index it if you want the ID as a column in a dataframe, or move on with it as an index to the series which can be handy depending on what your next steps will be.
df1 = df1.reset_index()
print(df1)
ID TEXT
0 1 This is the beginning of a convo heres the mid...
1 2 this is the start of another convo...etc.
I am trying some DataFrame manipulation in Pandas that I have learnt. The dataset that I am playing with is from the EY Data Science Challenge.
This first part may be irrelevant but just for context - I have gone through and set some indexes:
import pandas as pd
import numpy as np
# loading the main dataset
df_main = pd.read_csv(filename)
'''Sorting Indexes'''
# getting rid of the id column
del df_main['id']
# sorting values by LOCATION and GENDER columns
# setting index to LOCATION (1st tier) then GENDER (2nd tier) and then re-
#sorting
df_main = df_main.sort_values(['LOCATION','TIME'])
df_main = df_main.set_index(['LOCATION','TIME']).sort_index()
The problem I have is with the missing values - I have decided that columns 7 ~ 18 can be interpolate because a lot of the data is very consistent year by year.
So I made a simple function to take in a list of columns and apply the interpolate function for each column.
'''Missing Values'''
x = df_main.groupby("LOCATION")
def interpolate_columns(list_of_column_names):
for column in list_of_column_names:
df_main[column] = x[column].apply(lambda x: x.interpolate(how = 'linear'))
interpolate_columns( list(df_main.columns[7:18]) )
However, the problem I am getting is one of the columns (Access to electricity (% of urban population with access) [1.3_ACCESS.ELECTRICITY.URBAN]) seems to not be interpolating when all the other columns are interpolated successfully.
I get no errors thrown when I run the function, and it is not trying to interpolate backwards either.
Any ideas regarding why this problem is occurring?
EDIT: I should also mention that the column in question was missing the same number of values - and in the same rows - as many of the other columns that interpolated successfully.
After looking at the data more closely, it seems like interpolate was not working on some columns because I was missing data at the first rows of the group in the groupby object.