I have a questionnaire in this format
import pandas as pd
df = pd.DataFrame({'Question': ['Name', 'Age', 'Income','Name', 'Age', 'Income'],
'Answer': ['Bob', 50, 42000, 'Michelle', 42, 62000]})
As you can see the same 'Question' appears repeatedly, and I need to reformat this so that the result is as follows
df2 = pd.DataFrame({'Name': ['Bob', 'Michelle'],
'Age': [ 50, 42],
'Income': [42000,62000]})
Use numpy.reshape:
print (pd.DataFrame(df["Answer"].to_numpy().reshape((2,-1)), columns=df["Question"][:3]))
Or transpose and pd.concat:
s = df.set_index("Question").T
print (pd.concat([s.iloc[:, n:n+3] for n in range(0, len(s.columns), 3)]).reset_index(drop=True))
Both yield the same result:
Question Name Age Income
0 Bob 50 42000
1 Michelle 42 62000
You can create new column group with .assign that utilizes .groupby and .cumcount (Bob would be the first group and Michelle would be in the second group, with the groups being determined based off repetition of Name, Age, and Income)
Then .pivot the datraframe with the index being the group.
code:
df3 = (df.assign(group=df.groupby('Question').cumcount())
.pivot(index='group', values='Answer', columns='Question')
.reset_index(drop=True)[['Name','Age','Income']]) #[['Name','Age','Income']] at the end reorders the columns.
df3
Out[76]:
Question Name Age Income
0 Bob 50 42000
1 Michelle 42 62000
Here is a solution! It assumes that there are an even number of potential names for each observation (3 columns for Bob and Michelle, respectively):
import pandas as pd
df = pd.DataFrame({'Question': ['Name', 'Age', 'Income','Name', 'Age', 'Income'],
'Answer': ['Bob', 50, 42000, 'Michelle', 42, 62000]})
df=df.set_index("Question")
pd.concat([df.iloc[i:i+3,:].transpose() for i in range(0,len(df),3)],axis=0).reset_index(drop=True)
Related
For example, let's say that I have the dataframe:
NAME = ['BOB', 'BOB', 'BOB', 'SUE', 'SUE', 'MARY', 'JOHN', 'JOHN', 'MARK', 'MARK', 'MARK', 'MARK']
STATE = ['CA','CA','CA','DC','DC','PA','GA','GA','NY','NY','NY','NY']
MAJOR = ['MARKETING','BUSINESS ADM',np.nan,'ECONOMICS','MATH','PSYCHOLOGY','HISTORY','BUSINESS ADM','MATH', 'MEDICAL SCIENCES',np.nan,np.nan]
SCHOOL = ['UCLA','UCSB','CAL STATE','HARVARD','WISCONSIN','YALE','CHICAGO','MIT','UCSD','UCLA','CAL STATE','COMMUNITY']
data = {'NAME':NAME, 'STATE':STATE,'MAJOR':MAJOR, 'SCHOOL':SCHOOL}
df = pd.DataFrame(data)
I am to concatenate rows with multiple unique values for the same name.
I tried:
gr_columns = [x for x in df1.columns if x not in ['MAJOR','SCHOOL']]
df1 = df1.groupby(gr_columns).agg(lambda col: '|'.join(col))
and expected
I am trying to concatenate rows in columns where the NAME field is the same. Conveniently, the STATE field is static for each NAME. I would like the output to look like:
NAME
STATE
MAJOR
SCHOOL
BOB
CA
MARKETING,BUSINESS ADM
UCLA,UCSB,CAL STATE
SUE
DC
ECONOMICS,MATH
HARVARD,WISCONSIN
MARY
PA
PSYCHOLOGY
YALE
JOHN
GA
HISTORY,BUSINESS ADM
CHICAGO,MIT
MARK
NY
MATH,MEDICAL SCIENCES
UCSD,UCLA,CAL STATE,COMMUNITY
but instead, I get a single column containing the concatenated schools.
It is because your np.nan cannot be converted to str, so it is dropped automatically by pandas. You need to convert its type to str first:
df.groupby(['NAME', 'STATE']).agg(lambda x: ','.join(x.astype(str)))
To drop na and keep NAME and STATE as columns:
df.groupby(['NAME', 'STATE']).agg(lambda x: ','.join(x.dropna())).reset_index()
I have a dataset with several columns.
Now what I want is to basically calculate score based on a particular column ("name") but grouped on the "id" column.
_id fName lName age
0 ABCD Andrew Schulz
1 ABCD Andreww 23
2 DEFG John boy
3 DEFG Johnn boy 14
4 CDGH Bob TANNA 13
5 ABCD. Peter Parker 45
6 DEFGH Clark Kent 25
So what I am looking is whether for the same id, I am getting similar entries, so I can remove those entries based on a threshold score values. Like here if i run it for col "fName". I should be able to reduce this dataframe to based on a score threshold:
_id fName lName age
0 ABCD Andrew Schulz 23
2 DEFG John boy 14
4 CDGH Bob TANNA 13
5 ABCD Peter Parker 45
6 DEFG Clark Kent 25
I intend to use pyjarowinkler.
If I had two independent columns (without all the group by stuff) to check, this is how I use it.
df['score'] = [distance.get_jaro_distance(x, y) for x, y in zip(df['name_1'],df['name_2'])]
df = df[df['score'] > 0.87]
Can someone suggest a pythonic and fast way of doing this
UPDATE
So, I have tried using record linkage library for this. And I have ended up at a dataframe containing pair of indexes that are similar called 'matches'. Now I just want to basically combine the data.
# Indexation step
indexer = recordlinkage.Index()
indexer.block(left_on='_id')
candidate_links = indexer.index(df)
# Comparison step
compare_cl = recordlinkage.Compare()
compare_cl.string('fName', 'fName', method='jarowinkler', threshold=threshold, label='full_name')
features = compare_cl.compute(candidate_links, df)
# Classification step
matches = features[features.sum(axis=1) >= 1]
print(len(matches))
This is how matches looks:
index1 index2 fName
0 1 1.0
2 3 1.0
I need someone to suggest a way to combine the similar rows in a way that takes data from similar rows
just wanted to clear some doubts regarding your ques. Couldn't clear them in comments due to low reputation.
Like here if i run it for col "fName". I should be able to reduce this
dataframe to based on a score threshold:
So basically your function would return the DataFrame containing the first row in each group (by ID)? This will result in the above listed resultant DataFrame.
_id fName lName age
0 ABCD Andrew Schulz 23
2 DEFG John boy 14
4 CDGH Bob TANNA 13
I hope this code answer your question
r0 =['ABCD','Andrew','Schulz', '' ]
r1 =['ABCD','Andrew', '' , '23' ]
r2 =['DEFG','John' ,'boy' , '' ]
r3 =['DEFG','John' ,'boy' , '14' ]
r4 =['CDGH','Bob' ,'TANNA' , '13' ]
Rx =[r0,r1,r2,r3,r4]
print(Rx)
print()
Dict= dict()
for i in Rx:
if (Dict.__contains__(i[0]) == True):
if (i[2] != ''):
Dict[i[0]][2] = i[2]
if (i[3] != ''):
Dict[i[0]][3] = i[3]
else:
Dict[i[0]]=i
Rx[:] = Dict.values()
print(Rx)
I am lost with the 'score' part of your question, but if what you need is to fill the gaps in data with values from other rows and then drop the duplicates by id, maybe this can help:
df.replace('', np.nan, inplace=True)
df_filled = df.fillna(method='bfill').drop_duplicates('Id', keep='first')
First make sure that empty values are replaced with nulls. Then use fillna to 'back fill' the data. Then drop duplicates keeping the first occurrence of Id. fillna will fill the values from the next value found in the column, which may correspond to other Id, but since you will discard the duplicated rows, I believe drop_duplicates keeping the first occurrence will do the job. (This assumes that at least one value is provided in every column for every Id)
I've tested with this dataset and code:
data = [
['AABBCC', 'Andrew', '',],
['AABBCC', 'Andrew', 'Schulz'],
['AABBCC', 'Andrew', '', 23],
['AABBCC', 'Andrew', '',],
['AABBCC', 'Andrew', '',],
['DDEEFF', 'Karl', 'boy'],
['DDEEFF', 'Karl', ''],
['DDEEFF', 'Karl', '', 14],
['GGHHHH', 'John', 'TANNA', 13],
['HLHLHL', 'Bob', ''],
['HLHLHL', 'Bob', ''],
['HLHLHL', 'Bob', 'Blob'],
['HLHLHL', 'Bob', 'Blob', 15],
['HLHLHL', 'Bob','', 15],
['JLJLJL', 'Nick', 'Best', 20],
['JLJLJL', 'Nick', '']
]
df = pd.DataFrame(data, columns=['Id', 'fName', 'lName', 'Age'])
df.replace('', np.nan, inplace=True)
df_filled = df.fillna(method='bfill').drop_duplicates('Id', keep='first')
Output:
Id fName lName Age
0 AABBCC Andrew Schulz 23.0
5 DDEEFF Karl boy 14.0
8 GGHHHH John TANNA 13.0
9 HLHLHL Bob Blob 15.0
14 JLJLJL Nick Best 20.0
Hope this helps and apologies if I misunderstood the question.
import pandas as pd
stack = pd.DataFrame(['adam',25,28,'steve',25,28,'emily',18,21)
print(stack[0].to_list()[0::2])
print(stack[0].to_list()[1::2])
df = pd.DataFrame(
{'Name': stack[0].to_list()[0::3],
'Age': stack[0].to_list()[1::3],
'New Age': stack[0].to_list()[2::3] }
)
print(df)
It how do i separate adam and steve into a different row?
I want it to line up like the table below.
Table
You can get it as list and use slice [0::2] and [1::2]
import pandas as pd
data = pd.DataFrame(['adam',22,'steve',25,'emily',18])
print(data)
#print(data[0].to_list()[0::2])
#print(data[0].to_list()[1::2])
df = pd.DataFrame({
'Name': data[0].to_list()[0::2],
'Age': data[0].to_list()[1::2],
})
print(df)
Before (like on original image which was removed from question)
0
0 adam
1 22
2 steve
3 25
4 emily
5 18
After:
Name Age
0 adam 22
1 steve 25
2 emily 18
EDIT: image from original question
EDIT: BTW: the same with normal list
import pandas as pd
data = ['adam',22,'steve',25,'emily',18]
print(data)
df = pd.DataFrame({
'Name': data[0::2],
'Age': data[1::2],
})
print(df)
These two lines should do it. However, without knowing what code you have, what you're trying to accomplish, or what else you intend to do with it, the following code is only valid in this situation.
d = {'Name': ['adam', 'steve', 'emily'], 'Age': [22, 25, 18]}
df = pd.DataFrame(d)
I'm having a pandas issue.
I have a dataframe that looks like the following:
A B C D
0 max kate marie John
1 kate marie max john
2 john max kate marie
3 marie john kate max
And I need to access, for instance, the cell in row 0, column D.
I tried using df.iloc[0, 3] but it returns the whole D column.
Any help would be appreciated.
You could use
df.iloc[0]['D']
or
df.loc[0,'D']
Documentation reference DataFrame.iloc
To get the value at a location.
df.iloc[0]["D"]
seems to do the trick
It works fine if your Dataframe name is df.
df.iloc[0,3]
Out[15]: 'John'
You can refer to this for your solution
# Import pandas package
import pandas as pd
# Define a dictionary containing employee data
data = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age': [27, 24, 22, 32],
'Address': ['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification': ['Msc', 'MA', 'MCA', 'Phd']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
print(pd)
Then you got the table like this
if you want the name in the 0-th row and in 0-th column("Name")
synthax = dataframe.iloc[row-index]['ColumnName']
print(df.iloc[0]['Name'])
This is my code:
import pandas as pd
data = [('index', 'name', 'age'), ('idx01', 'John', 23), ('idx02', 'Marc', 32), ('idx03', 'Helena', 12)]
columns = data.pop(0)
df = pd.DataFrame(data, columns=columns).set_index(columns[0])
print(df)
Which produces:
name age
index <----- Where is this row coming from?
idx01 John 23
idx02 Marc 32
idx03 Helena 12
I do not understand where is the empty index row coming from. Is this a header, or a data row? Why is it being generated? It is just an empty row, but other dataframes (generated with other methods) in my code do not have that empty row. I would like to make sure my data is not corrupted somehow.
It is no empty row. Try remove index.name by df.index.name=None and no empty row. If you set_index from column with name, index.name is column name like index in this sample.
import pandas as pd
data = [('index', 'name', 'age'),
('idx01', 'John', 23),
('idx02', 'Marc', 32),
('idx03', 'Helena', 12)]
columns = data.pop(0)
df = pd.DataFrame(data, columns=columns).set_index(columns[0])
print(df)
name age
index
idx01 John 23
idx02 Marc 32
idx03 Helena 12
df.index.name=None
print(df)
name age
idx01 John 23
idx02 Marc 32
idx03 Helena 12