Find an occurrence in a table column using panda - python

I'm a complete beginner using Python and I'm trying to find a way to know how much Female and Male I have in my table.
I've imported my table and printed it but don't know how to find occurrences in my columns
tableur = gc.open("KIDS91").sheet1
print(tableur)
CITY CSP BIRTHDAY NAME GENDER
LOND 48 01/04/2009 Peter M
LOND 20 06/11/2008 Lucy F
LOND 22 23/06/2009 Piper F

See pandas value_counts -
tableur.GENDER.value_counts()

Perhaps this will be good?
import pandas as pd
df = pd.read_csv("KIDS91", sep ='\t')
df.GENDER.value_counts()
a bit simplifying:
First line - import pandas
Second line - loading yours data frame to memory
Last line - return counted values from column GENDER in data frame named df.

Related

KeyError at Index 0 when applying function to dataframe

I am working on a basic code - my aim is to use gender_guesser.detector to find the gender for rows in a dataframe where the imported file (from CSV) would be missing those values. For simplicity, I just created a dummy dataframe in the code below.
I am pretty new to Python and very much in the learning phase so I assume that there are definitely more elegant solutions to what I am trying to do. My idea was to add a new column, find the values for each row using the above mentioned function and then fill out the NaN values while keeping the original gender values where applicable (dropping the temp column once it's done).
The d.get_gender part works if I manually apply it to a specific row and Jupyter accepts the function as well.
df = pd.DataFrame([['Adam','Smith',''],['Lindsay','Jackson','M'],['Laura','Jones','F'],['Arthur','Jackson','']] ,columns=['first_name','last_name','gender'])
import gender_guesser.detector as gender
df['newgender']=""
def findgender(dataframe):
for row in dataframe:
d = gender.Detector()
df.loc[row, 'newgender'] = d.get_gender(df.loc[row,'first_name'])
return df
df.apply(findgender, axis=1)
When I then try to apply this to my dataframe, I get a lengthy error message, the last line being
KeyError: ('Adam', 'occurred at index 0')
I tried to look up similar posts here but for most, adding axis=1 solved the issue - as I already have it, I am clueless why the code is not working.
Any help or explanation on why the issue is occurring would be extremely helpful.
I'm not sure why you are getting that error. Usually, it is better to avoid accessing a dataframe line by line. The following solution seems to work using a lambda function.
import pandas as pd
import gender_guesser.detector as gender
df = pd.DataFrame([['Adam','Smith',''],['Lindsay','Jackson','M'],['Laura','Jones','F'],['Arthur','Jackson','']] ,columns=['first_name','last_name','gender'])
df['newgender'] = df['first_name'].apply(lambda x: gender.Detector().get_gender(x))
It produces the following result.
first_name last_name gender newgender
0 Adam Smith male
1 Lindsay Jackson M mostly_female
2 Laura Jones F female
3 Arthur Jackson male

How to loop over a dataframe, add new fields to a series, then append that series to a csv?

Given a Pandas dataframe such as:
Name Age
John 20
Mary 65
Bob 55
I wish to iterate over the rows, decide whether each person is a senior (age>=60) or not, create a new entry with an extra column, then append that to a csv file such that it (the csv file) reads as follows:
Name Age Senior
John 20 False
Mary 65 True
Bob 55 False
Other than saving the data to a csv, I am able to do the rest by turning the series the loop is currently iterating over to a dictionary then adding a new key.
for idx, e in records.iterrows():
entry = e.to_dict()
entry["senior"] = (entry["age"]<60)
Simply converting dict to series to dataframe isnt writing it to the csv file properly. Is there a pandas or non-pandas way of making this work?
IMPORTANT EDIT : The above is a simplified example, I am dealing with hundreds of rows and the data I want to add is a long string that will be created during run time, so looping is mandatory. Also, adding that to the original dataframe isnt an option as I am pretty sure Ill run out of program memory at some point (so I cant add the data to the original dataframe nor create a new dataframe with all the information). I dont want to add the data to the original dataframe, only to a copy of a "row" that will then be appended to a csv.
The example is given to provide some context for my question, but the main focus should be on the question, not the example.
Loops here are not necessary, only assign new column by compare with scalar and for avoid create columns in original DataFrame use DataFrame.assign - it return new DataFrame with new column and original is not changed:
df1 = df.assign(senior = df["age"]>=60)
EDIT:
If really need loops (not recommended):
for idx, e in df.iterrows():
df.loc[idx, "senior"] = e["Age"]>=60
print (df)
Name Age senior
0 John 20 False
1 Mary 65 True
2 Bob 55 False
Also you can use ge:
df2 = df.copy()
df2['senior'] = df2['Age'].ge(60)
And now:
print(df2)
Output:
Name Age senior
0 John 20 False
1 Mary 65 True
2 Bob 55 False
use np.where
import numpy as np
df1 = df.copy()
df1['Senior'] = np.where(df1['Age']>60,True,False)
Found the answer I needed here: Convert a dictionary to a pandas dataframe
Code:
first_entry=True
for idx, e in records.iterrows():
entry = e.to_dict()
entry["senior"] = (entry["age"]<60)
df_entry = pd.DataFrame([entry], columns=entry.keys())
df_entry.to_csv(output_path, sep=',', index=False, columns=header,header=first_entry,mode='a')
#output_path is a variable with path to csv, header is a variable with list of new column names
first_entry=False
Was hoping for a better way to do it, but this one works fine.

Print and write values in a file greater and lesser than 10 and make a plot of it by using Python

I have a enormous data in (.csv) format which consists of various columns from that of my interest is column 3 and 7. I want to print both columns
Sample Data: {Only Col 3 and 7 are displayed}
Names Numbers
John 12
Kim 5
Alex 16
mike 2
giki 8
David 18
Desired Output #values greater than 10:
John 12
Alex 16
David 18
Desired Output #values lesser than 10:
Kim 5
mike 2
giki 8
Rhea
I'm not sure I understand what are trying to accomplish there, therefore I'll try to help you going through some basic stuff:
a) Do you already have your data on a DataFrame format? Or it is in some form of tabular data such as a csv or Excel file?
Dataframe = Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns).
Anyways you will have to import pandas to read or manipulate this file. Then you can transform it into a DataFrame using one of Pandas reading functions, such as pandas.read_csv or pandas.read_excel.
import pandas as pd
# if your data is in a dictionary
df = pd.DataFrame(data=d)
# csv
df = pd.read_csv('file name and path')
b) Then you can slice through it using pandas, and create new DataFrames
output1 = df.loc[df['Numbers'] > 10]
output2 = df.loc[df['Numbers'] < 10]
c) The most basic way to plot is using the pandas method plot on your new DataFrame (you can get a lot fancier than that using matplotlib or seaborn). Although you should probably think about what kind of information you want to visualize, which is not clear to me.
out1.plot()
#histogram
out2.hist()
d) You can save your new dataframes using pandas as well. Here is an example of a CSV file
df.to_csv(path_or_buf=None, sep=', ', na_rep='', float_format=None, columns=None, header=True, index=True, index_label=None)
I hope I could shed some light into your doubts ;) .

I have a code here, I want to find the total number of females and males in a certain csv file.

import pandas as pd
df = pd.read_csv('admission_data.csv')
df.head()
female = 0
male = 0
for row in df:
if df['gender']).any()=='female':
female = female+1
else:
male = male+1
print (female)
print male
The CSV file has 5 columnsHere is the picture
I want to find the total number of females, males and number of them admitted, number of females admitted, males admitted
Thank you. This is the code I have tried and some more iterations of the above code but none of them seem to work.
Your if logic is wrong.
No need for a loop at all.
print(df['gender'].tolist().count('female'))
print(df['gender'].tolist().count('male'))
Alternatively you can use value_counts as #Wen suggested:
print(df['gender'].value_counts()['male'])
print(df['gender'].value_counts()['female'])
Rule of thumb: 99% of the times there is no need to use explicit loops when working with pandas. If you find yourself using one then there is most probably a better (and faster) way.
You just need value_counts
df['gender'].value_counts()
I created the below csv file:
student_id,gender,major,admitted
35377,female,chemistry,False
56105,male,physics,True
31441,female,chemistry,False
51765,male,physics,True
31442,female,chemistry,True
Reading the csv file into dataframe:
import pandas as pd
df=pd.read_csv('D:/path/test1.csv', sep=',')
df[df['admitted']==True].groupby(['gender','admitted']).size().reset_index(name='count')
df
gender admitted count
0 female True 1
1 male True 2
Hope this helps!
i think you can use these brother,
// This line creates create a data frame which only have gender as male
count_male=df[df['Gender']=="male"]
// 2nd line you are basically counting how many values are there in gender column
count_male['Gender'].size
(or)
count_male=df['Gender']=="male"]
count_male.sum()
Take the values in the column gender, store in a list, and count the occurrences:
import pandas as pd
df = pd.read_csv('admission_data.csv')
print(list(df['gender']).count('female'))
print(list(df['gender']).count('male'))

Dataframe Not Getting Updated Pandas

import pandas as pd
import numpy as np
titanic= pd.read_csv("C:\\Users\\Shailesh.Rana\\Downloads\\train.csv")
title=[] #to extract titles out of names
for i in range(len(titanic)):
title.append(titanic.loc[:,"Name"].iloc[i].split(" ")[1]) #index 1 is title
titanic.iloc[(np.array(title)=="Master.")&(np.array(titanic.Age.isnull()))].loc[:,"Age"]=3.5
#values with master title and NAN as age
The last line doesn't make a change to the original dataset. In fact, if I run this line again, it still shows a series with 4 NaN values.
Use str.split with str[1] for select second list.
Also convert to numpy array is not necessary, iloc should be removed too.
titanic = pd.DataFrame({'Name':['John Master.','Joe','Mary Master.'],
'Age':[10,20,np.nan]})
titanic.loc[(titanic.Name.str.split().str[1]=="Master.") &(titanic.Age.isnull()) ,"Age"]=3.5
print (titanic)
Age Name
0 10.0 John Master.
1 20.0 Joe
2 3.5 Mary Master.

Categories

Resources