Splitting series of Strings into Dataframe - python

I have a big series of strings that I want to split into a dataframe.
The series looks like this:
s = pd.Series({"1":"Name=Marc-Age=48-Car=Tesla",
"2":"Name=Ben-Job=Pilot-Car=Porsche",
"3":"Name=Tom-Age=24-Car=Ford"})
I want to split this into a dataframe looking like this:
Name Age Job Car
1 Marc 48 Nan Tesla
2 Ben Nan Pilot Porsche
3 Tom 24 Nan Ford
I tried to split the strings first by "-" and then by "=" but don't understand how to continue after
df=s.str.split("-", expand=True)
for col in df.columns:
df[col]=df[col].str.split("=")
I get this:
` 0 1 2`
`1 ['Name', 'Marc'] ['Age', '48'] ['Car', 'Tesla']`
`2 ['Name', 'Ben'] ['Job', 'Pilot'] ['Car', 'Porsche']`
`3 ['Name', 'Tom'] ['Age', '24'] ['Car', 'Ford']`
I don't know how to continue from here. I can't loop through the rows because my dataset is really big.
Can anyone help on how to go on from here?

If you split then explode and split again you can then use a pivot.
import pandas as pd
s = pd.Series({"1":"Name=Marc-Age=48-Car=Tesla",
"2":"Name=Ben-Job=Pilot-Car=Porsche",
"3":"Name=Tom-Age=24-Car=Ford"})
s = s.str.split('-').explode().str.split('=', expand=True).reset_index()
s = s.pivot(index='index', columns=0, values=1).reset_index(drop=True)
Output
Age Car Job Name
0 48 Tesla NaN Marc
1 NaN Porsche Pilot Ben
2 24 Ford NaN Tom

Related

How to search a list of strings in a data frame column and return the matching string as an adjacent column

What I have
I have a column 'Student' with students name and their personalities.
I have a list named as 'qualities' that consisits of qualities that is required for filtering purpose.
What I want
I want a column next to to the 'Student' that returns the matching string from the list.
What I have
import pandas as pd
Personality = {'Student':["Aysha is clever", "Ben is stronger", "Cathy is clever and strong", "Dany is intelligent", "Ella is naughty", "Fred is quieter"]}
index_labels=['1','2','3','4','5','6']
df = pd.DataFrame(Personality,index=index_labels)
qualities = ['calm', 'clever', 'quiet', 'bold', 'strong', 'cute']
What I want
Output
Use str.findall and then split by ,
df['ex'] = df['Student'].str.findall('|'.join(qualities)).apply(set).str.join(', ')
new = df["ex"].str.split(pat = ",", expand=True)[1]
df =pd.concat([df, new], axis = 1)
df = df.fillna('')
print(df)
Gives #
Student ex 1
1 Aysha is clever clever
2 Ben is stronger strong
3 Cathy is clever and strong clever, strong strong
4 Dany is intelligent
5 Ella is naughty
6 Fred is quieter quiet
You can use str.extractall and unstack, then join to the original DataFrame:
import re
pattern = '|'.join(map(re.escape, qualities))
out = df.join(df['Student'].str.extractall(f'({pattern})')[0].unstack())
Output:
Student 0 1
1 Aysha is clever clever NaN
2 Ben is stronger strong NaN
3 Cathy is clever and strong clever strong
4 Dany is intelligent NaN NaN
5 Ella is naughty NaN NaN
6 Fred is quieter quiet NaN

How to collapse all rows in pandas dataframe across all columns

I am trying to collapse all the rows of a dataframe into one single row across all columns.
My data frame looks like the following:
name
job
value
bob
business
100
NAN
dentist
Nan
jack
Nan
Nan
I am trying to get the following output:
name
job
value
bob jack
business dentist
100
I am trying to group across all columns, I do not care if the value column is converted to dtype object (string).
I'm just trying to collapse all the rows across all columns.
I've tried groupby(index=0) but did not get good results.
You could apply join:
out = df.apply(lambda x: ' '.join(x.dropna().astype(str))).to_frame().T
Output:
name job value
0 bob jack business dentist 100.0
Try this:
new_df = df.agg(lambda x: x.dropna().astype(str).tolist()).str.join(' ').to_frame().T
Output:
>>> new_df
name job value
0 bob jack business dentist 100.0

Reading nested json to pandas dataframe

I have below URL that has a JSON response. I need to read this json into a pandas dataframe and perform operations on top of it . This is a case of nested JSON which consists of multiple lists and dicts within dicts.
URL: 'http://api.nobelprize.org/v1/laureate.json'
I have tried below code:
import json, pandas as pd,requests
resp=requests.get('http://api.nobelprize.org/v1/laureate.json')
df=pd.json_normalize(json.loads(resp.content),record_path =['laureates'])
print(df.head(5))
Output-
id firstname surname born died \
0 1 Wilhelm Conrad Röntgen 1845-03-27 1923-02-10
1 2 Hendrik A. Lorentz 1853-07-18 1928-02-04
2 3 Pieter Zeeman 1865-05-25 1943-10-09
3 4 Henri Becquerel 1852-12-15 1908-08-25
4 5 Pierre Curie 1859-05-15 1906-04-19
bornCountry bornCountryCode bornCity \
0 Prussia (now Germany) DE Lennep (now Remscheid)
1 the Netherlands NL Arnhem
2 the Netherlands NL Zonnemaire
3 France FR Paris
4 France FR Paris
diedCountry diedCountryCode diedCity gender \
0 Germany DE Munich male
1 the Netherlands NL NaN male
2 the Netherlands NL Amsterdam male
3 France FR NaN male
4 France FR Paris male
prizes
0 [{'year': '1901', 'category': 'physics', 'shar...
1 [{'year': '1902', 'category': 'physics', 'shar...
2 [{'year': '1902', 'category': 'physics', 'shar...
3 [{'year': '1903', 'category': 'physics', 'shar...
4 [{'year': '1903', 'category': 'physics', 'shar...
But in this prizes comes as a list. If I create a separate dataframe for prizes, it has affiliations as list.I want all columns to come as separate columns. Some entires may/may not have prizes. So need to handle that case as well.
I went through this article https://towardsdatascience.com/all-pandas-json-normalize-you-should-know-for-flattening-json-13eae1dfb7dd. Looks like we'll have to use meta and error=ignore here, but not able to fix it. Appreciate your inputs here. Thanks.
You would have to do this in few steps.
The first step would be to extract the first record_path = ['laureates']
The second one would be record_path = ['laureates', 'prizes'] for the nested json records with meta path as the id from the parent record
Combine the two datasets by joining on the id column.
Drop the unnecessary columns and store
import json, pandas as pd, requests
resp = requests.get('http://api.nobelprize.org/v1/laureate.json')
df0 = pd.json_normalize(json.loads(resp.content),record_path = ['laureates'])
df1 = pd.json_normalize(json.loads(resp.content),record_path = ['laureates','prizes'], meta = [['laureates','id']])
output = pd.merge(df0, df1, left_on='id', right_on='laureates.id').drop(['prizes','laureates.id'], axis=1, inplace=False)
print('Shape of data ->',output.shape)
print('Columns ->',output.columns)
Shape of data -> (975, 18)
Columns -> Index(['id', 'firstname', 'surname', 'born', 'died', 'bornCountry',
'bornCountryCode', 'bornCity', 'diedCountry', 'diedCountryCode',
'diedCity', 'gender', 'year', 'category', 'share', 'motivation',
'affiliations', 'overallMotivation'],
dtype='object')
Found an alternate solution as well with lesser code. This works.
from flatten_json import flatten
data = winners['laureates']
dict_flattened = (flatten(record, '.') for record in data)
df = pd.DataFrame(dict_flattened)
print(df.shape)
(968, 43)

Python Pandas concatenate every 2nd row to previous row

I have a Pandas dataframe similar to this one:
age name sex
0 30 jon male
1 blue php null
2 18 jane female
3 orange c++ null
and I am trying to concatenate every second row to the previous one adding extra columns:
age name sex colour language other
0 30 jon male blue php null
1 18 jane female orange c++ null
I tried shift() but was duplicating every row.
How can this be done?
You can create a new dataframe by slicing the dataframe using iloc with a step of 2:
cols = ['age', 'name', 'sex']
new_cols = ['colour', 'language', 'other']
d = dict()
for col, ncol in zip(cols, new_cols):
d[col] = df[col].iloc[::2].values
d[ncol] = df[col].iloc[1::2].values
pd.DataFrame(d)
Result:
age colour name language sex other
0 30 blue jon PHP male NaN
1 18 orange jane c++ female NaN
TRY:
df = pd.concat([df.iloc[::2].reset_index(drop=True), pd.DataFrame(
df.iloc[1::2].values, columns=['colour', 'language', 'other'])], 1)
OUTPUT:
age name sex colour language other
0 30 jon male blue php NaN
1 18 jane female orange c++ NaN
Reshape the values and create a new dataframe
pd.DataFrame(df.values.reshape(-1, df.shape[1] * 2),
columns=['age', 'name', 'sex', 'colour', 'language', 'other'])
age name sex colour language other
0 30 jon male blue php NaN
1 18 jane female orange c++ NaN

Apply a function on elements in a Pandas column, grouped on another column

I have a dataset with several columns.
Now what I want is to basically calculate score based on a particular column ("name") but grouped on the "id" column.
_id fName lName age
0 ABCD Andrew Schulz
1 ABCD Andreww 23
2 DEFG John boy
3 DEFG Johnn boy 14
4 CDGH Bob TANNA 13
5 ABCD. Peter Parker 45
6 DEFGH Clark Kent 25
So what I am looking is whether for the same id, I am getting similar entries, so I can remove those entries based on a threshold score values. Like here if i run it for col "fName". I should be able to reduce this dataframe to based on a score threshold:
_id fName lName age
0 ABCD Andrew Schulz 23
2 DEFG John boy 14
4 CDGH Bob TANNA 13
5 ABCD Peter Parker 45
6 DEFG Clark Kent 25
I intend to use pyjarowinkler.
If I had two independent columns (without all the group by stuff) to check, this is how I use it.
df['score'] = [distance.get_jaro_distance(x, y) for x, y in zip(df['name_1'],df['name_2'])]
df = df[df['score'] > 0.87]
Can someone suggest a pythonic and fast way of doing this
UPDATE
So, I have tried using record linkage library for this. And I have ended up at a dataframe containing pair of indexes that are similar called 'matches'. Now I just want to basically combine the data.
# Indexation step
indexer = recordlinkage.Index()
indexer.block(left_on='_id')
candidate_links = indexer.index(df)
# Comparison step
compare_cl = recordlinkage.Compare()
compare_cl.string('fName', 'fName', method='jarowinkler', threshold=threshold, label='full_name')
features = compare_cl.compute(candidate_links, df)
# Classification step
matches = features[features.sum(axis=1) >= 1]
print(len(matches))
This is how matches looks:
index1 index2 fName
0 1 1.0
2 3 1.0
I need someone to suggest a way to combine the similar rows in a way that takes data from similar rows
just wanted to clear some doubts regarding your ques. Couldn't clear them in comments due to low reputation.
Like here if i run it for col "fName". I should be able to reduce this
dataframe to based on a score threshold:
So basically your function would return the DataFrame containing the first row in each group (by ID)? This will result in the above listed resultant DataFrame.
_id fName lName age
0 ABCD Andrew Schulz 23
2 DEFG John boy 14
4 CDGH Bob TANNA 13
I hope this code answer your question
r0 =['ABCD','Andrew','Schulz', '' ]
r1 =['ABCD','Andrew', '' , '23' ]
r2 =['DEFG','John' ,'boy' , '' ]
r3 =['DEFG','John' ,'boy' , '14' ]
r4 =['CDGH','Bob' ,'TANNA' , '13' ]
Rx =[r0,r1,r2,r3,r4]
print(Rx)
print()
Dict= dict()
for i in Rx:
if (Dict.__contains__(i[0]) == True):
if (i[2] != ''):
Dict[i[0]][2] = i[2]
if (i[3] != ''):
Dict[i[0]][3] = i[3]
else:
Dict[i[0]]=i
Rx[:] = Dict.values()
print(Rx)
I am lost with the 'score' part of your question, but if what you need is to fill the gaps in data with values from other rows and then drop the duplicates by id, maybe this can help:
df.replace('', np.nan, inplace=True)
df_filled = df.fillna(method='bfill').drop_duplicates('Id', keep='first')
First make sure that empty values are replaced with nulls. Then use fillna to 'back fill' the data. Then drop duplicates keeping the first occurrence of Id. fillna will fill the values from the next value found in the column, which may correspond to other Id, but since you will discard the duplicated rows, I believe drop_duplicates keeping the first occurrence will do the job. (This assumes that at least one value is provided in every column for every Id)
I've tested with this dataset and code:
data = [
['AABBCC', 'Andrew', '',],
['AABBCC', 'Andrew', 'Schulz'],
['AABBCC', 'Andrew', '', 23],
['AABBCC', 'Andrew', '',],
['AABBCC', 'Andrew', '',],
['DDEEFF', 'Karl', 'boy'],
['DDEEFF', 'Karl', ''],
['DDEEFF', 'Karl', '', 14],
['GGHHHH', 'John', 'TANNA', 13],
['HLHLHL', 'Bob', ''],
['HLHLHL', 'Bob', ''],
['HLHLHL', 'Bob', 'Blob'],
['HLHLHL', 'Bob', 'Blob', 15],
['HLHLHL', 'Bob','', 15],
['JLJLJL', 'Nick', 'Best', 20],
['JLJLJL', 'Nick', '']
]
df = pd.DataFrame(data, columns=['Id', 'fName', 'lName', 'Age'])
df.replace('', np.nan, inplace=True)
df_filled = df.fillna(method='bfill').drop_duplicates('Id', keep='first')
Output:
Id fName lName Age
0 AABBCC Andrew Schulz 23.0
5 DDEEFF Karl boy 14.0
8 GGHHHH John TANNA 13.0
9 HLHLHL Bob Blob 15.0
14 JLJLJL Nick Best 20.0
Hope this helps and apologies if I misunderstood the question.

Categories

Resources