Having trouble with pandas

Having trouble with pandas - python

import pandas as pd
stack = pd.DataFrame(['adam',25,28,'steve',25,28,'emily',18,21)
print(stack[0].to_list()[0::2])
print(stack[0].to_list()[1::2])
df = pd.DataFrame(
{'Name': stack[0].to_list()[0::3],
'Age': stack[0].to_list()[1::3],
'New Age': stack[0].to_list()[2::3] }
)
print(df)
It how do i separate adam and steve into a different row?
I want it to line up like the table below.
Table

You can get it as list and use slice [0::2] and [1::2]
import pandas as pd
data = pd.DataFrame(['adam',22,'steve',25,'emily',18])
print(data)
#print(data[0].to_list()[0::2])
#print(data[0].to_list()[1::2])
df = pd.DataFrame({
'Name': data[0].to_list()[0::2],
'Age': data[0].to_list()[1::2],
})
print(df)
Before (like on original image which was removed from question)
0
0 adam
1 22
2 steve
3 25
4 emily
5 18
After:
Name Age
0 adam 22
1 steve 25
2 emily 18
EDIT: image from original question
EDIT: BTW: the same with normal list
import pandas as pd
data = ['adam',22,'steve',25,'emily',18]
print(data)
df = pd.DataFrame({
'Name': data[0::2],
'Age': data[1::2],
})
print(df)

These two lines should do it. However, without knowing what code you have, what you're trying to accomplish, or what else you intend to do with it, the following code is only valid in this situation.
d = {'Name': ['adam', 'steve', 'emily'], 'Age': [22, 25, 18]}
df = pd.DataFrame(d)

Related

In python, how to generate random sampling without replacement of a specific column?

Input data looks somewhat like this
import pandas as pd
df = pd.DataFrame({'users': ['John', 'Bob', 'Alice', 'John', 'Alice','Bob','Alice'],
'class': ['Economics','Economics','Economics','Maths','Maths','Physics','Physics']})
The random data should be generated such that class will not be replaced but users can be replaced.
random_df1 = pd.DataFrame({'users': ['John', 'Bob', 'Alice'],
'class': ['Economics','Maths','Physics']})
or
random_df2 = pd.DataFrame({'users': ['John', 'John', 'Bob'],
'class': ['Economics','Maths','Physics']})

Use Series.unique to get the unique values in column class then create a new dataframe with users (within a given class) randomly selected using np.random.choice:
df_ = pd.DataFrame([
{'users': np.random.choice(df.loc[df['class'].eq(c), 'users']), 'class': c}
for c in df['class'].unique()])
Result:
print(df_)
users class
0 John Economics
1 Alice Maths
2 Alice Physics

Use groupby on class column and then use sample method to randomly select samples from particular class
df = df.groupby("class").apply(lambda x: x.sample(1)).reset_index(drop=True)
Output:
users class
0 Bob Economics
1 Alice Maths
2 Bob Physics

Turning repeated row labels into column headers in pandas

I have a questionnaire in this format
import pandas as pd
df = pd.DataFrame({'Question': ['Name', 'Age', 'Income','Name', 'Age', 'Income'],
'Answer': ['Bob', 50, 42000, 'Michelle', 42, 62000]})
As you can see the same 'Question' appears repeatedly, and I need to reformat this so that the result is as follows
df2 = pd.DataFrame({'Name': ['Bob', 'Michelle'],
'Age': [ 50, 42],
'Income': [42000,62000]})

Use numpy.reshape:
print (pd.DataFrame(df["Answer"].to_numpy().reshape((2,-1)), columns=df["Question"][:3]))
Or transpose and pd.concat:
s = df.set_index("Question").T
print (pd.concat([s.iloc[:, n:n+3] for n in range(0, len(s.columns), 3)]).reset_index(drop=True))
Both yield the same result:
Question Name Age Income
0 Bob 50 42000
1 Michelle 42 62000

You can create new column group with .assign that utilizes .groupby and .cumcount (Bob would be the first group and Michelle would be in the second group, with the groups being determined based off repetition of Name, Age, and Income)
Then .pivot the datraframe with the index being the group.
code:
df3 = (df.assign(group=df.groupby('Question').cumcount())
.pivot(index='group', values='Answer', columns='Question')
.reset_index(drop=True)[['Name','Age','Income']]) #[['Name','Age','Income']] at the end reorders the columns.
df3
Out[76]:
Question Name Age Income
0 Bob 50 42000
1 Michelle 42 62000

Here is a solution! It assumes that there are an even number of potential names for each observation (3 columns for Bob and Michelle, respectively):
import pandas as pd
df = pd.DataFrame({'Question': ['Name', 'Age', 'Income','Name', 'Age', 'Income'],
'Answer': ['Bob', 50, 42000, 'Michelle', 42, 62000]})
df=df.set_index("Question")
pd.concat([df.iloc[i:i+3,:].transpose() for i in range(0,len(df),3)],axis=0).reset_index(drop=True)

Pandas - create data frame from nested key values and nested list in the dictionary

How to do untangle nested dictionary with list in key/value into columns? I tried different combination to solve converting nested dictionary into pandas data frame. Looked through the stack I am getting close to fix the problem just not quite.
Sample Data:
test = {
'abc': {
'company_id': '123c',
'names': ['Oscar', 'John Smith', 'Smith, John'],
'education': ['MS', 'BS']
},
'DEF': {
'company_id': '124b',
'names': ['Matt B.'],
'education': ['']
}
}
Tried:
1)
pd.DataFrame(list(test.items())) # not working entirely - creates {dictionary in col '1'}
2)
df = pd.concat({
k: pd.DataFrame.from_dict(v, 'index') for k, v in test.items()
},
axis=0)
df2 = df.T
df2.reset_index() # creates multiple columns
Output Needed:

Update:
With the release of pandas 0.25 and the addition of explode this is now a lot easier:
frame = pd.DataFrame(test).T
frame = frame.explode('names').set_index(
['company_id', 'names'],
append=True).explode(
'education').reset_index(
['company_id', 'names']
)
Pre pandas 0.25:
This is not really lean but then this is a rather complicated transformation.
Inspired by this blog post, I solved it using two separate iterations of turning the list column into a series and then transforming the DataFrame using melt.
import pandas as pd
test = {
'abc': {
'company_id': '123c',
'names': ['Oscar', 'John Smith', 'Smith, John'],
'education': ['MS', 'BS']
},
'DEF': {
'company_id': '124b',
'names': ['Matt B.'],
'education': ['']
}
}
frame = pd.DataFrame(test).T
names = frame.names.apply(pd.Series)
frame = frame.merge(
names, left_index=True, right_index=True).drop('names', axis=1)
frame = frame.reset_index().melt(
id_vars=['index', 'company_id', 'education'],
value_name='names').drop('variable', axis=1).dropna()
education = frame.education.apply(pd.Series)
frame = frame.merge(
education, left_index=True, right_index=True).drop('education', axis=1)
frame = frame.melt(
id_vars=['index', 'company_id', 'names'],
value_name='education').drop(
'variable', axis=1).dropna().sort_values(by=['company_id', 'names'])
frame.columns = ['set_name', 'company_id', 'names', 'education']
print(frame)
Result:
set_name company_id names education
2 abc 123c John Smith MS
6 abc 123c John Smith BS
0 abc 123c Oscar MS
4 abc 123c Oscar BS
3 abc 123c Smith, John MS
7 abc 123c Smith, John BS
1 DEF 124b Matt B.

accessing a pandas DataFrame cell

I'm having a pandas issue.
I have a dataframe that looks like the following:
A B C D
0 max kate marie John
1 kate marie max john
2 john max kate marie
3 marie john kate max
And I need to access, for instance, the cell in row 0, column D.
I tried using df.iloc[0, 3] but it returns the whole D column.
Any help would be appreciated.

You could use
df.iloc[0]['D']
or
df.loc[0,'D']
Documentation reference DataFrame.iloc
To get the value at a location.

df.iloc[0]["D"]
seems to do the trick

It works fine if your Dataframe name is df.
df.iloc[0,3]
Out[15]: 'John'

You can refer to this for your solution
# Import pandas package
import pandas as pd
# Define a dictionary containing employee data
data = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age': [27, 24, 22, 32],
'Address': ['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification': ['Msc', 'MA', 'MCA', 'Phd']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
print(pd)
Then you got the table like this
if you want the name in the 0-th row and in 0-th column("Name")
synthax = dataframe.iloc[row-index]['ColumnName']
print(df.iloc[0]['Name'])

How to transform pandas JSON column into dataframe?

I have a .csv file with mix of columns where some contain entries in JSON syntax (nested). I want to extract relevant data from these columns to obtain a more data-rich dataframe for further analysis. I've checked this tutorial on Kaggle but I failed to obtain the desired result.
In order to better explain my problem I've prepared a dummy version of a database below.
raw = {"team":["Team_1","Team_2"],
"who":[[{"name":"Andy", "age":22},{"name":"Rick", "age":30}],[{"name":"Oli", "age":19},{"name":"Joe", "age":21}]]}
df = pd.DataFrame(raw)
I'd like to generate the following columns (or equivalent):
team name_1 name_2 age_1 age_2
Team_1 Andy Rick 22 30
Team_2 Oli Joe 19 21
I've tried the following.
Code 1:
test_norm = json_normalize(data=df)
AttributeError: 'str' object has no attribute 'values'
Code 2:
test_norm = json_normalize(data=df, record_path='who')
TypeError: string indices must be integers
Code 3:
test_norm = json_normalize(data=df, record_path='who', meta=[team])
TypeError: string indices must be integers
Is there any way to do it in an effectively? I've looked for a solution in other stackoverflow topics and I cannot find a working solution with json_normalize.

I also had trouble using json_normalize on the lists of dicts that were contained in the who column. My workaround was to reformat each row into a Dict with unique keys (name_1, age_1, name_2, etc.) for each team member's name/age. After this, creating a dataframe with your desired structure was trivial.
Here are my steps. Beginning with your example:
raw = {"team":["Team_1","Team_2"],
"who":[[{"name":"Andy", "age":22},{"name":"Rick", "age":30}],[{"name":"Oli", "age":19},{"name":"Joe", "age":21}]]}
df = pd.DataFrame(raw)
df
team who
0 Team_1 [{'name': 'Andy', 'age': 22}, {'name': 'Rick',...
1 Team_2 [{'name': 'Oli', 'age': 19}, {'name': 'Joe', '...
Write a method to reformat a list as a Dict and apply to each row in the who column:
def reformat(x):
res = {}
for i, item in enumerate(x):
res['name_' + str(i+1)] = item['name']
res['age_' + str(i+1)] = item['age']
return res
df['who'] = df['who'].apply(lambda x: reformat(x))
df
team who
0 Team_1 {'name_1': 'Andy', 'age_1': 22, 'name_2': 'Ric...
1 Team_2 {'name_1': 'Oli', 'age_1': 19, 'name_2': 'Joe'...
Use json_normalize on the who column. Then ensure the columns of the normalized dataframe appear in the desired order:
import pandas as pd
from pandas.io.json import json_normalize
n = json_normalize(data = df['who'], meta=['team'])
n = n.reindex(sorted(n.columns, reverse=True, key=len), axis=1)
n
name_1 name_2 age_1 age_2
0 Andy Rick 22 30
1 Oli Joe 19 21
Join the dataframe created by json_normalize back to the original df, and drop the who column:
df = df.join(n).drop('who', axis=1)
df
team name_1 name_2 age_1 age_2
0 Team_1 Andy Rick 22 30
1 Team_2 Oli Joe 19 21
If your real .csv file has too many rows, my solution may be a bit too expensive (seeing as how it iterates over each row, and then over each entry inside the list contained in each row). If (hopefully) this isn't the case, perhaps my approach will be good enough.

One option would be to unpack the dictionary yourself. Like so:
from pandas.io.json import json_normalize
raw = {"team":["Team_1","Team_2"],
"who":[[{"name":"Andy", "age":22},{"name":"Rick", "age":30}],[{"name":"Oli", "age":19},{"name":"Joe", "age":21}]]}
# add the corresponding team to the dictionary containing the person information
for idx, list_of_people in enumerate(raw['who']):
for person in list_of_people:
person['team'] = raw['team'][idx]
# flatten the dictionary
list_of_dicts = [dct for list_of_people in raw['who'] for dct in list_of_people]
# normalize to dataframe
json_normalize(list_of_dicts)
# due to unpacking of dict, this results in the same as doing
pd.DataFrame(list_of_dicts)
This outputs a little different. My output is often more convenient for further analysis.
Output:
age name team
22 Andy Team_1
30 Rick Team_1
19 Oli Team_2
21 Joe Team_2

You can iterate through each element in raw['who'] separately, but when you do this the resultant dataframe will have both the opponents in separate rows.
Example:
json_normalize(raw['who'][0])
Output:
age name
22 Andy
30 Rick
You can flatten these out into a single row and then concatenate all the rows to get your final output.
def flatten(df_temp):
df_temp.index = df_temp.index.astype(str)
flattened_df = df_temp.unstack().to_frame().sort_index(level=1).T
flattened_df.columns = flattened_df.columns.map('_'.join)
return flattened_df
df = pd.concat([flatten(pd.DataFrame(json_normalize(x))) for x in raw['who']])
df['team'] = raw['team']
Output:
age_0 name_0 age_1 name_1 team
22 Andy 30 Rick Team_1
19 Oli 21 Joe Team_2

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Having trouble with pandas - python

These two lines should do it. However, without knowing what code you have, what you're trying to accomplish, or what else you intend to do with it, the following code is only valid in this situation. d = {'Name': ['adam', 'steve', 'emily'], 'Age': [22, 25, 18]} df = pd.DataFrame(d)

Related

In python, how to generate random sampling without replacement of a specific column?

Turning repeated row labels into column headers in pandas

Pandas - create data frame from nested key values and nested list in the dictionary

accessing a pandas DataFrame cell

How to transform pandas JSON column into dataframe?

Categories

Resources