How do I read specific csv file into pandas df? - python

I'm having a problem reading the file titanic.csv into a pandas dataframe. The csv is delimited by ",", but when I try to read into pandas with the following code:
df = pd.read_csv("titanic_train.csv")
df.head()
I get an issue with all values ending up in the first column. I tried to add delimiter="," in the read command, but still no luck.
Any ideas on where I'm going wrong?
Thanks a lot!

Like others mentioned, a simple read_csv should have worked for you.
Here are few ways to debug:
You can run the all-inclusive code below and see if it functions.
You can copy paste the included string in a text file and try to load it.
You can use an online python editor e.g. google colab, to ensure that its not related to your local setup.
You can post the link to csv to get further help.
import pandas as pd
from io import StringIO
sample=StringIO('''PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,E46,S
8,0,3,"Palsson, Master. Gosta Leonard",male,2,3,1,349909,21.075,,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27,0,2,347742,11.1333,,S
''')
df = pd.read_csv(sample)
print(df)
Output:
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 NaN S
5 6 0 3 ... 8.4583 NaN Q

Related

python pandas csv_read put all row in first cell of the row

I am playing with famous titanic data. I have data csv with comma separation. And data looks like this:
passengerId,survived,pclass,name,sex,age,sibSp,parch,ticket,fare,cabin,embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
I am trying to use pandas.csv_read but it doesn't work.
My code:
import pandas as pd
titanic = pd.read_csv('titanic.csv')
print(titanic.head(10))
I tried couple combinations with argues of the csv_read method: sep = ',', decimal = ',', delimiter = ',' and still I got the same output which is:
passengerId survived ... cabin embarked
0 1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/... NaN ... NaN NaN
1 2,1,1,"Cumings, Mrs. John Bradley (Florence Br... NaN ... NaN NaN
2 3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,S... NaN ... NaN NaN
I tryied to search in other stackoverflow questions but I couldn't find an answer. Thank you for your help.
It seems like the problem is that you have some commas inside your columns.
The quotechar parameter might help you since it will tell pandas to ignore commas between the char specified (")
titanic = pd.read_csv('titanic.csv', quotechar='"', sep=",")

Plotting boolean frequency against qualitative data in pandas

I'll start off by saying that I'm not really talented in statistical analysis. I have a dataset stored in a .csv file that I'm looking to represent graphically. What I'm trying to represent is the frequency of survival (represented for each person as a 0 or 1 in the Survived column) for each unique entry in the other columns.
For example: one of the other columns, Class, holds one of three possible values (1, 2, or 3). I want to graph the probability that someone from Class 1 survives versus Class 2 versus Class 3, so that I can visually determine whether or not class is correlated to survival rate.
I've attached the snippet of code that I've developed so far, but I'd understand if everything I'm doing is wrong because I've never used pandas before.
1 import pandas as pd
2 import matplotlib.pyplot as plt
3
4 df = pd.read_csv('train.csv')
5
6 print(list(df)[2:]) # slicing first 2 values of "ID" and "Survived"
7
8 for column in list(df)[2:]:
9 try:
10 df.plot(x='Survived',y=column,kind='hist')
11 except TypeError:
12 print("Column {} not usable.".format(column))
13
14 plt.show()
EDIT: I've attached a small segment of the dataframe below
PassengerId Survived Pclass Name ... Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris ... A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... ... PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina ... STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) ... 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry ... 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James ... 330877 8.4583 NaN Q
I think you want this:
df.groupby('Pclass')['Survived'].mean()
This separates the dataframe into three groups based on the three unique values of Pclass. It then takes the mean of Survived, which is equal to the number of 1 values divided by the number of values total. This would produce a dataframe looking something like this:
Pclass
1 0.558824
2 0.636364
3 0.696970
It is then trivial from there to plot a bar graph with .plot.bar() if you wish.
Adding to the answer, here is a simple bar graph.
result = df.groupby('Pclass')['Survived'].mean()
result.plot(kind='bar', rot=1, ylim=(0, 1))

Find and replace in dataframe from another dataframe

I have two dataframes, here are snippets of both below. I am trying to find and replace the artists names in the second dataframe with the id's in the first dataframe. Is there a good way to do this?
id fullName
0 1 Colin McCahon
1 2 Robert Henry Dickerson
2 3 Arthur Dagley
Artists
0 Arthur Dagley, Colin McCahon, Maria Cruz
1 Fiona Gilmore, Peter Madden, Nicholas Spratt, ...
2 Robert Henry Dickerson
3 Steve Carr
Desired output:
Artists
0 3, 1, Maria Cruz
1 Fiona Gilmore, Peter Madden, Nicholas Spratt, ...
2 2
3 Steve Carr
You mean check with replace
df1.Artists.replace(dict(zip(df.fullName,df.id.astype(str))),regex=True)
0 3, 1, Maria Cruz
1 Fiona Gilmore, Peter Madden, Nicholas Spratt, ...
2 2
3 Steve Carr
Name: Artists, dtype: object
Convert your first dataframe into a dictionary:
d = Series(name_df.id.astype(str),index=name_df.fullName).to_dict()
Then use .replace():
artists_df["Artists"] = artists_df["Artists"].replace(d, regex=True)

Pivot pandas dataframe with dates and showing counts per date

I have the following pandas DataFrame: (currently ~500 rows):
merged_verified =
Last Verified Verified by
0 2016-07-11 John Doe
1 2016-07-11 John Doe
2 2016-07-12 John Doe
3 2016-07-11 Mary Smith
4 2016-07-12 Mary Smith
I am attempting to pivot_table() it to receive the following:
Last Verified 2016-07-11 2016-07-12
Verified by
John Doe 2 1
Mary Smith 1 1
Currently I'm running
merged_verified = merged_verified.pivot_table(index=['Verified by'], values=['Last Verified'], aggfunc='count')
which gives me close to what I need, but not exactly:
Last Verified
Verified by
John Doe 3
Mary Smith 2
I've tried a variety of things with the parameters, but none of it worked. The result above is the closest I've come to what I need. I read somewhere I would need to add an additional column that uses dummy values (1's) that I can then add but that seems counter-intuitive for a what I believe to be simple DataFrame layout.
You can add parameter columns and aggragate by len:
merged_verified = merged_verified.pivot_table(index=['Verified by'],
columns=['Last Verified'],
values=['Last Verified'],
aggfunc=len)
print (merged_verified)
Last 2016-07-11 2016-07-12
Verified by
Doe 2 1
Smith 1 1
Or you also omit values:
merged_verified = merged_verified.pivot_table(index=['Verified by'],
columns=['Last Verified'],
aggfunc=len)
print (merged_verified)
Last Verified 2016-07-11 2016-07-12
Verified by
John Doe 2 1
Mary Smith 1 1
Use groupby, value_counts, and unstack:
merged_verified.groupby('Last Verified')['Verified by'].value_counts().unstack(0)
Timing
Example dataframe
Large dataframe 1 million rows
idx = pd.MultiIndex.from_product(
[
pd.date_range('2016-03-01', periods=100),
pd.DataFrame(np.random.choice(letters, (10000, 10))).sum(1)
], names=['Last Verified', 'Verified by'])
merged_verified = idx.to_series().reset_index()[idx.names]

In Pandas, how to create a unique ID based on the combination of many columns?

I have a very large dataset, that looks like
df = pd.DataFrame({'B': ['john smith', 'john doe', 'adam smith', 'john doe', np.nan], 'C': ['indiana jones', 'duck mc duck', 'batman','duck mc duck',np.nan]})
df
Out[173]:
B C
0 john smith indiana jones
1 john doe duck mc duck
2 adam smith batman
3 john doe duck mc duck
4 NaN NaN
I need to create a ID variable, that is unique for every B-C combination. That is, the output should be
B C ID
0 john smith indiana jones 1
1 john doe duck mc duck 2
2 adam smith batman 3
3 john doe duck mc duck 2
4 NaN NaN 0
I actually dont care about whether the index starts at zero or not, and whether the value for the missing columns is 0 or any other number. I just want something fast, that does not take a lot of memory and can be sorted quickly.
I use:
df['combined_id']=(df.B+df.C).rank(method='dense')
but the output is float64 and takes a lot of memory. Can we do better?
Thanks!
I think you can use factorize:
df['combined_id'] = pd.factorize(df.B+df.C)[0]
print df
B C combined_id
0 john smith indiana jones 0
1 john doe duck mc duck 1
2 adam smith batman 2
3 john doe duck mc duck 1
4 NaN NaN -1
Making jezrael's answer a little more general (what if the columns were not string?), you can use this compact function:
def make_identifier(df):
str_id = df.apply(lambda x: '_'.join(map(str, x)), axis=1)
return pd.factorize(str_id)[0]
df['combined_id'] = make_identifier(df[['B','C']])
jezrael's answer is great. But since this is for multiple columns, I prefer to use .ngroup() since this way NaN could remain NaN.
df['combined_id'] = df.groupby(['B', 'C'], sort = False).ngroup()
df

Categories

Resources