import pandas as pd
import numpy as np
titanic= pd.read_csv("C:\\Users\\Shailesh.Rana\\Downloads\\train.csv")
title=[] #to extract titles out of names
for i in range(len(titanic)):
title.append(titanic.loc[:,"Name"].iloc[i].split(" ")[1]) #index 1 is title
titanic.iloc[(np.array(title)=="Master.")&(np.array(titanic.Age.isnull()))].loc[:,"Age"]=3.5
#values with master title and NAN as age
The last line doesn't make a change to the original dataset. In fact, if I run this line again, it still shows a series with 4 NaN values.
Use str.split with str[1] for select second list.
Also convert to numpy array is not necessary, iloc should be removed too.
titanic = pd.DataFrame({'Name':['John Master.','Joe','Mary Master.'],
'Age':[10,20,np.nan]})
titanic.loc[(titanic.Name.str.split().str[1]=="Master.") &(titanic.Age.isnull()) ,"Age"]=3.5
print (titanic)
Age Name
0 10.0 John Master.
1 20.0 Joe
2 3.5 Mary Master.
Related
I have 2 csv files as below. I have to add the Lat and Long features from one csv to another based on the ID
Based on the ID, I have to add the columns Latitude and Longitude to CSV 1 (ID 1 has different values and ID 2 has different values and so on.
Can anyone please tell me how to do this using python?
I recommend you use the pandas library. If you've not come across it before, it's a very widely-used library for data manipulation that makes this task very easy. However, it's not part of the standard library that comes packaged with Python, so depending on your environment you may need to install it using PyPi. Guides like this might help if you've not done that before.
With pandas installed in your environment, you can run the following code (substitute in your file paths for csv1.csv and csv2.csv).
import pandas as pd
# Load the csv files into dataframes and set the index of each to the 'ID' column
df1 = pd.read_csv('csv1.csv').set_index('ID')
df2 = pd.read_csv('csv2.csv').set_index('ID')
# Join using how='outer' to keep all rows from both dataframes
joined = df1.join(df2, how='outer')
print(joined)
# Save to a new csv file
joined.to_csv('joined.csv')
I made some simple sample data to demonstrate the result of running the code:
csv1.csv:
ID,my_feature
1,banana
2,apple
3,pear
csv2.csv:
ID,latitude,longitude
1,7,-4
4,10,15
Printed output:
my_feature latitude longitude
ID
1 banana 7.0 -4.0
2 apple NaN NaN
3 pear NaN NaN
4 NaN 10.0 15.0
Output to joined.csv:
ID,my_feature,latitude,longitude
1,banana,7.0,-4.0
2,apple,,
3,pear,,
4,,10.0,15.0
First time posting here - have decided to try and learn how to use python whilst on Covid-19 forced holidays.
I'm trying to summarise some data from a pretty simple database and have been using the value_counts function.
Rather than running it on every column individually, I'd like to loop it over each one and return a summary table. I can do this using df.apply(pd.value_counts) but can't work out how to enter parameters into the the value counts as I want to have dropna = False.
Basic example of data I have:
# Import libraries
import pandas as pd
import numpy as np
# create list of winners and runnerup
data = [['john', 'barry'], ['john','barry'], [np.nan,'barry'], ['barry','john'],['john',np.nan],['linda','frank']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['winner', 'runnerup'])
# print dataframe.
df
How I was doing the value counts for each column:
#Who won the most?
df['winner'].value_counts(dropna=False)
Output:
john 3
linda 1
barry 1
NaN 1
Name: winner, dtype: int64
How can I enter the dropna=False when using apply function? I like the table it outputs below but want the NaN to appear in the list.
#value counts table
df.apply(pd.value_counts)
winner runnerup
barry 1.0 3.0
frank NaN 1.0
john 3.0 1.0
linda 1.0 NaN
#value that is missing from list
#NaN 1.0 1.0
Any help would be appreciated!!
You can use df.apply, like this:
df.apply(pd.value_counts, dropna=False)
In pandas apply function, if there is a single parameter, you simply do:
.apply(func_name)
The parameter is the value of the cell.
This works exactly the same way for pandas build in function as well as user defined functions (UDF).
for UDF, when there are more than one parameters:
.apply(func_name, args=(arg1, arg2, arg3, ...))
See: this link
Given a Pandas dataframe such as:
Name Age
John 20
Mary 65
Bob 55
I wish to iterate over the rows, decide whether each person is a senior (age>=60) or not, create a new entry with an extra column, then append that to a csv file such that it (the csv file) reads as follows:
Name Age Senior
John 20 False
Mary 65 True
Bob 55 False
Other than saving the data to a csv, I am able to do the rest by turning the series the loop is currently iterating over to a dictionary then adding a new key.
for idx, e in records.iterrows():
entry = e.to_dict()
entry["senior"] = (entry["age"]<60)
Simply converting dict to series to dataframe isnt writing it to the csv file properly. Is there a pandas or non-pandas way of making this work?
IMPORTANT EDIT : The above is a simplified example, I am dealing with hundreds of rows and the data I want to add is a long string that will be created during run time, so looping is mandatory. Also, adding that to the original dataframe isnt an option as I am pretty sure Ill run out of program memory at some point (so I cant add the data to the original dataframe nor create a new dataframe with all the information). I dont want to add the data to the original dataframe, only to a copy of a "row" that will then be appended to a csv.
The example is given to provide some context for my question, but the main focus should be on the question, not the example.
Loops here are not necessary, only assign new column by compare with scalar and for avoid create columns in original DataFrame use DataFrame.assign - it return new DataFrame with new column and original is not changed:
df1 = df.assign(senior = df["age"]>=60)
EDIT:
If really need loops (not recommended):
for idx, e in df.iterrows():
df.loc[idx, "senior"] = e["Age"]>=60
print (df)
Name Age senior
0 John 20 False
1 Mary 65 True
2 Bob 55 False
Also you can use ge:
df2 = df.copy()
df2['senior'] = df2['Age'].ge(60)
And now:
print(df2)
Output:
Name Age senior
0 John 20 False
1 Mary 65 True
2 Bob 55 False
use np.where
import numpy as np
df1 = df.copy()
df1['Senior'] = np.where(df1['Age']>60,True,False)
Found the answer I needed here: Convert a dictionary to a pandas dataframe
Code:
first_entry=True
for idx, e in records.iterrows():
entry = e.to_dict()
entry["senior"] = (entry["age"]<60)
df_entry = pd.DataFrame([entry], columns=entry.keys())
df_entry.to_csv(output_path, sep=',', index=False, columns=header,header=first_entry,mode='a')
#output_path is a variable with path to csv, header is a variable with list of new column names
first_entry=False
Was hoping for a better way to do it, but this one works fine.
I'm a complete beginner using Python and I'm trying to find a way to know how much Female and Male I have in my table.
I've imported my table and printed it but don't know how to find occurrences in my columns
tableur = gc.open("KIDS91").sheet1
print(tableur)
CITY CSP BIRTHDAY NAME GENDER
LOND 48 01/04/2009 Peter M
LOND 20 06/11/2008 Lucy F
LOND 22 23/06/2009 Piper F
See pandas value_counts -
tableur.GENDER.value_counts()
Perhaps this will be good?
import pandas as pd
df = pd.read_csv("KIDS91", sep ='\t')
df.GENDER.value_counts()
a bit simplifying:
First line - import pandas
Second line - loading yours data frame to memory
Last line - return counted values from column GENDER in data frame named df.
import pandas as pd
df = pd.read_csv('admission_data.csv')
df.head()
female = 0
male = 0
for row in df:
if df['gender']).any()=='female':
female = female+1
else:
male = male+1
print (female)
print male
The CSV file has 5 columnsHere is the picture
I want to find the total number of females, males and number of them admitted, number of females admitted, males admitted
Thank you. This is the code I have tried and some more iterations of the above code but none of them seem to work.
Your if logic is wrong.
No need for a loop at all.
print(df['gender'].tolist().count('female'))
print(df['gender'].tolist().count('male'))
Alternatively you can use value_counts as #Wen suggested:
print(df['gender'].value_counts()['male'])
print(df['gender'].value_counts()['female'])
Rule of thumb: 99% of the times there is no need to use explicit loops when working with pandas. If you find yourself using one then there is most probably a better (and faster) way.
You just need value_counts
df['gender'].value_counts()
I created the below csv file:
student_id,gender,major,admitted
35377,female,chemistry,False
56105,male,physics,True
31441,female,chemistry,False
51765,male,physics,True
31442,female,chemistry,True
Reading the csv file into dataframe:
import pandas as pd
df=pd.read_csv('D:/path/test1.csv', sep=',')
df[df['admitted']==True].groupby(['gender','admitted']).size().reset_index(name='count')
df
gender admitted count
0 female True 1
1 male True 2
Hope this helps!
i think you can use these brother,
// This line creates create a data frame which only have gender as male
count_male=df[df['Gender']=="male"]
// 2nd line you are basically counting how many values are there in gender column
count_male['Gender'].size
(or)
count_male=df['Gender']=="male"]
count_male.sum()
Take the values in the column gender, store in a list, and count the occurrences:
import pandas as pd
df = pd.read_csv('admission_data.csv')
print(list(df['gender']).count('female'))
print(list(df['gender']).count('male'))