Extraneous empty row after header - python

This is my code:
import pandas as pd
data = [('index', 'name', 'age'), ('idx01', 'John', 23), ('idx02', 'Marc', 32), ('idx03', 'Helena', 12)]
columns = data.pop(0)
df = pd.DataFrame(data, columns=columns).set_index(columns[0])
print(df)
Which produces:
name age
index <----- Where is this row coming from?
idx01 John 23
idx02 Marc 32
idx03 Helena 12
I do not understand where is the empty index row coming from. Is this a header, or a data row? Why is it being generated? It is just an empty row, but other dataframes (generated with other methods) in my code do not have that empty row. I would like to make sure my data is not corrupted somehow.

It is no empty row. Try remove index.name by df.index.name=None and no empty row. If you set_index from column with name, index.name is column name like index in this sample.
import pandas as pd
data = [('index', 'name', 'age'),
('idx01', 'John', 23),
('idx02', 'Marc', 32),
('idx03', 'Helena', 12)]
columns = data.pop(0)
df = pd.DataFrame(data, columns=columns).set_index(columns[0])
print(df)
name age
index
idx01 John 23
idx02 Marc 32
idx03 Helena 12
df.index.name=None
print(df)
name age
idx01 John 23
idx02 Marc 32
idx03 Helena 12

Related

How to pivot a table based on the values of one column

let's say I have the below dataframe:
dataframe = pd.DataFrame({'col1': ['Name', 'Location', 'Phone','Name', 'Location'],
'Values': ['Mark', 'New York', '656','John', 'Boston']})
which looks like this:
col1 Values
Name Mark
Location New York
Phone 656
Name John
Location Boston
As you can see I have my wanted columns as rows in col1 and not all values have a Phone number, is there a way for me to transform this dataframe to look like this:
Name Location Phone
Mark New York 656
John Boston NaN
I have tried to transpose in Excel, do a Pivot and a Pivot_Table:
pivoted = pd.pivot_table(data = dataframe, values='Values', columns='col1')
But this comes out incorrectly. any help would be appreciated on this.
NOTES: All new section start with the Name value and end before the Name value of the next person.
Create a new index using cumsum to identify unique sections then do pivot as usual...
df['index'] = df['col1'].eq('Name').cumsum()
df.pivot('index', 'col1', 'Values')
col1 Location Name Phone
index
1 New York Mark 656
2 Boston John NaN

Turning repeated row labels into column headers in pandas

I have a questionnaire in this format
import pandas as pd
df = pd.DataFrame({'Question': ['Name', 'Age', 'Income','Name', 'Age', 'Income'],
'Answer': ['Bob', 50, 42000, 'Michelle', 42, 62000]})
As you can see the same 'Question' appears repeatedly, and I need to reformat this so that the result is as follows
df2 = pd.DataFrame({'Name': ['Bob', 'Michelle'],
'Age': [ 50, 42],
'Income': [42000,62000]})
Use numpy.reshape:
print (pd.DataFrame(df["Answer"].to_numpy().reshape((2,-1)), columns=df["Question"][:3]))
Or transpose and pd.concat:
s = df.set_index("Question").T
print (pd.concat([s.iloc[:, n:n+3] for n in range(0, len(s.columns), 3)]).reset_index(drop=True))
Both yield the same result:
Question Name Age Income
0 Bob 50 42000
1 Michelle 42 62000
You can create new column group with .assign that utilizes .groupby and .cumcount (Bob would be the first group and Michelle would be in the second group, with the groups being determined based off repetition of Name, Age, and Income)
Then .pivot the datraframe with the index being the group.
code:
df3 = (df.assign(group=df.groupby('Question').cumcount())
.pivot(index='group', values='Answer', columns='Question')
.reset_index(drop=True)[['Name','Age','Income']]) #[['Name','Age','Income']] at the end reorders the columns.
df3
Out[76]:
Question Name Age Income
0 Bob 50 42000
1 Michelle 42 62000
Here is a solution! It assumes that there are an even number of potential names for each observation (3 columns for Bob and Michelle, respectively):
import pandas as pd
df = pd.DataFrame({'Question': ['Name', 'Age', 'Income','Name', 'Age', 'Income'],
'Answer': ['Bob', 50, 42000, 'Michelle', 42, 62000]})
df=df.set_index("Question")
pd.concat([df.iloc[i:i+3,:].transpose() for i in range(0,len(df),3)],axis=0).reset_index(drop=True)

How to replace column values based on a list?

I have a list like this:
x = ['Las Vegas', 'San Francisco, 'Dallas']
And a dataframe that looks a bit like this:
import pandas as pd
data = [['Las Vegas (Clark County), 25], ['New York', 23],
['Dallas', 27]]
df = pd.DataFrame(data, columns = ['City', 'Value'])
I want to replace my city values in the DF "Las Vegas (Clark County)" with "Las Vegas". In my dataframe are multiple cities with different names which needs to be changed. I know I could do a regex expression to just strip off the part after the parentheses, but I was wondering if there was a more clever, generic way.
Use Series.str.extract with joined values of list by | for regex OR and then replace non matched values to original by Series.fillna:
df['City'] = df['City'].str.extract(f'({"|".join(x)})', expand=False).fillna(df['City'])
print (df)
City Value
0 Las Vegas 25
1 New York 23
2 Dallas 27
Another idea is use Series.str.contains with loop, but it should be slow if large Dataframe and many values in list:
for val in x:
df.loc[df['City'].str.contains(val), 'City'] = val

accessing a pandas DataFrame cell

I'm having a pandas issue.
I have a dataframe that looks like the following:
A B C D
0 max kate marie John
1 kate marie max john
2 john max kate marie
3 marie john kate max
And I need to access, for instance, the cell in row 0, column D.
I tried using df.iloc[0, 3] but it returns the whole D column.
Any help would be appreciated.
You could use
df.iloc[0]['D']
or
df.loc[0,'D']
Documentation reference DataFrame.iloc
To get the value at a location.
df.iloc[0]["D"]
seems to do the trick
It works fine if your Dataframe name is df.
df.iloc[0,3]
Out[15]: 'John'
You can refer to this for your solution
# Import pandas package
import pandas as pd
# Define a dictionary containing employee data
data = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age': [27, 24, 22, 32],
'Address': ['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification': ['Msc', 'MA', 'MCA', 'Phd']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
print(pd)
Then you got the table like this
if you want the name in the 0-th row and in 0-th column("Name")
synthax = dataframe.iloc[row-index]['ColumnName']
print(df.iloc[0]['Name'])

How to transform pandas JSON column into dataframe?

I have a .csv file with mix of columns where some contain entries in JSON syntax (nested). I want to extract relevant data from these columns to obtain a more data-rich dataframe for further analysis. I've checked this tutorial on Kaggle but I failed to obtain the desired result.
In order to better explain my problem I've prepared a dummy version of a database below.
raw = {"team":["Team_1","Team_2"],
"who":[[{"name":"Andy", "age":22},{"name":"Rick", "age":30}],[{"name":"Oli", "age":19},{"name":"Joe", "age":21}]]}
df = pd.DataFrame(raw)
I'd like to generate the following columns (or equivalent):
team name_1 name_2 age_1 age_2
Team_1 Andy Rick 22 30
Team_2 Oli Joe 19 21
I've tried the following.
Code 1:
test_norm = json_normalize(data=df)
AttributeError: 'str' object has no attribute 'values'
Code 2:
test_norm = json_normalize(data=df, record_path='who')
TypeError: string indices must be integers
Code 3:
test_norm = json_normalize(data=df, record_path='who', meta=[team])
TypeError: string indices must be integers
Is there any way to do it in an effectively? I've looked for a solution in other stackoverflow topics and I cannot find a working solution with json_normalize.
I also had trouble using json_normalize on the lists of dicts that were contained in the who column. My workaround was to reformat each row into a Dict with unique keys (name_1, age_1, name_2, etc.) for each team member's name/age. After this, creating a dataframe with your desired structure was trivial.
Here are my steps. Beginning with your example:
raw = {"team":["Team_1","Team_2"],
"who":[[{"name":"Andy", "age":22},{"name":"Rick", "age":30}],[{"name":"Oli", "age":19},{"name":"Joe", "age":21}]]}
df = pd.DataFrame(raw)
df
team who
0 Team_1 [{'name': 'Andy', 'age': 22}, {'name': 'Rick',...
1 Team_2 [{'name': 'Oli', 'age': 19}, {'name': 'Joe', '...
Write a method to reformat a list as a Dict and apply to each row in the who column:
def reformat(x):
res = {}
for i, item in enumerate(x):
res['name_' + str(i+1)] = item['name']
res['age_' + str(i+1)] = item['age']
return res
df['who'] = df['who'].apply(lambda x: reformat(x))
df
team who
0 Team_1 {'name_1': 'Andy', 'age_1': 22, 'name_2': 'Ric...
1 Team_2 {'name_1': 'Oli', 'age_1': 19, 'name_2': 'Joe'...
Use json_normalize on the who column. Then ensure the columns of the normalized dataframe appear in the desired order:
import pandas as pd
from pandas.io.json import json_normalize
n = json_normalize(data = df['who'], meta=['team'])
n = n.reindex(sorted(n.columns, reverse=True, key=len), axis=1)
n
name_1 name_2 age_1 age_2
0 Andy Rick 22 30
1 Oli Joe 19 21
Join the dataframe created by json_normalize back to the original df, and drop the who column:
df = df.join(n).drop('who', axis=1)
df
team name_1 name_2 age_1 age_2
0 Team_1 Andy Rick 22 30
1 Team_2 Oli Joe 19 21
If your real .csv file has too many rows, my solution may be a bit too expensive (seeing as how it iterates over each row, and then over each entry inside the list contained in each row). If (hopefully) this isn't the case, perhaps my approach will be good enough.
One option would be to unpack the dictionary yourself. Like so:
from pandas.io.json import json_normalize
raw = {"team":["Team_1","Team_2"],
"who":[[{"name":"Andy", "age":22},{"name":"Rick", "age":30}],[{"name":"Oli", "age":19},{"name":"Joe", "age":21}]]}
# add the corresponding team to the dictionary containing the person information
for idx, list_of_people in enumerate(raw['who']):
for person in list_of_people:
person['team'] = raw['team'][idx]
# flatten the dictionary
list_of_dicts = [dct for list_of_people in raw['who'] for dct in list_of_people]
# normalize to dataframe
json_normalize(list_of_dicts)
# due to unpacking of dict, this results in the same as doing
pd.DataFrame(list_of_dicts)
This outputs a little different. My output is often more convenient for further analysis.
Output:
age name team
22 Andy Team_1
30 Rick Team_1
19 Oli Team_2
21 Joe Team_2
You can iterate through each element in raw['who'] separately, but when you do this the resultant dataframe will have both the opponents in separate rows.
Example:
json_normalize(raw['who'][0])
Output:
age name
22 Andy
30 Rick
You can flatten these out into a single row and then concatenate all the rows to get your final output.
def flatten(df_temp):
df_temp.index = df_temp.index.astype(str)
flattened_df = df_temp.unstack().to_frame().sort_index(level=1).T
flattened_df.columns = flattened_df.columns.map('_'.join)
return flattened_df
df = pd.concat([flatten(pd.DataFrame(json_normalize(x))) for x in raw['who']])
df['team'] = raw['team']
Output:
age_0 name_0 age_1 name_1 team
22 Andy 30 Rick Team_1
19 Oli 21 Joe Team_2

Categories

Resources