This is my dataframe (where the values in the authors column are comma separated strings):
authors book
Jim, Charles The Greatest Book in the World
Jim An OK book
Charlotte A book about books
Charlotte, Jim The last book
How do I transform it to a long format, like this:
authors book
Jim The Greatest Book in the World
Jim An OK book
Jim The last book
Charles The Greatest Book in the World
Charlotte A book about books
Charlotte The last book
I've tried extracting the individual authors to a list, authors = list(df['authors'].str.split(',')), flatten that list, matched every author to every book, and construct a new list of dicts with every match. But that doesn't seem very pythonic to me, and I'm guessing pandas has a cleaner way to do this.
You can split the authors column by column after setting the index to the book which will get you almost all the way there. Rename and sort columns to finish.
df.set_index('book').authors.str.split(',', expand=True).stack().reset_index('book')
book 0
0 The Greatest Book in the World Jim
1 The Greatest Book in the World Charles
0 An OK book Jim
0 A book about books Charlotte
0 The last book Charlotte
1 The last book Jim
And to get you all the way home
df.set_index('book')\
.authors.str.split(',', expand=True)\
.stack()\
.reset_index('book')\
.rename(columns={0:'authors'})\
.sort_values('authors')[['authors', 'book']]\
.reset_index(drop=True)
The best option is to use pandas.Series.str.split, and then to pandas.DataFrame.explode the list.
Split on ', ', otherwise values following the comma will be preceded by a whitespace (e.g. ' Charles')
Tested in python 3.10, pandas 1.4.3
import pandas as pd
data = {'authors': ['Jim, Charles', 'Jim', 'Charlotte', 'Charlotte, Jim'], 'book': ['The Greatest Book in the World', 'An OK book', 'A book about books', 'The last book']}
df = pd.DataFrame(data)
# display(df)
authors book
0 Jim, Charles The Greatest Book in the World
1 Jim An OK book
2 Charlotte A book about books
3 Charlotte, Jim The last book
# split authors
df.authors = df.authors.str.split(', ')
# explode the column (with a fresh 0, 1... index)
df = df.explode('authors', ignore_index=True)
# display(df)
authors book
0 Jim The Greatest Book in the World
1 Charles The Greatest Book in the World
2 Jim An OK book
3 Charlotte A book about books
4 Charlotte The last book
5 Jim The last book
Related
I am using google colab and there is a file which called 'examples' and inside there are three txt files.
I am using the following code to read and convert them to pandas
dataset_filepaths = glob.glob('examples/*.txt')
for filepath in tqdm.tqdm(dataset_filepaths):
df = pd.read_csv(filepath)
If you print the dataset_filepaths you will see
['examples/kate_middleton.txt',
'examples/jane_doe.txt',
'examples/daniel_craig.txt']
which is correct. However, in the df there is only the first document. Could you please let me know how we can create a pandas in the following form
index text
-----------------
0 text0
1 text1
. .
. .
. .
Updated: #Steven Rumbalski using your code
dfs = [pd.read_csv(filepath) for filepath in tqdm.tqdm(dataset_filepaths)]
dfs
The output looks like this
[Empty DataFrame
Columns: [Kate Middleton is the wife of Prince William. She is a mother of 3 children; 2 boys and a girl. Kate is educated to university level and that is where she met her future husband. Kate dresses elegantly and is often seen carrying out charity work. However, she is a mum first and foremost and the interactions we see with her children are adorable. Kate’s sister, Pippa, has followed Kate into the public eye. She was born in 1982 and will soon turn 40. When pregnant, Kate suffers from a debilitating illness called Hyperemesis Gravidarum, which was little known about until it was reported that Kate had it.]
Index: [], Empty DataFrame
Columns: [Jane Doe was born in December 1978 and is currently living in London, United Kingdom.]
Index: [], Empty DataFrame
Columns: [He is an English film actor known for playing James Bond in the 007 series of films. Since 2005, he has been playing the character but he confirmed that No Time to Die would be his last James Bond film. He was born in Chester on 2nd of March in 1968. He moved to Liverpool when his parents divorced and lived there until he was sixteen years old. He auditioned and was accepted into the National Youth Theatre and moved down to London. He studied at Guildhall School of Music and Drama. He has appeared in many films.]
Index: []]
How can I convert it in the form that I want?
I have a solution below to give me a new column as a universal identifier, but what if there is additional data in the NAME column, how can I tweak the below to account for a wildcard like search term?
I want to basically have so if German/german or Mexican/mexican is in that row value then to give me Euro or South American value in new col
df["Identifier"] = (df["NAME"].str.lower().replace(
to_replace = ['german', 'mexican'],
value = ['Euro', 'South American']
))
print(df)
NAME Identifier
0 German Euro
1 german Euro
2 Mexican South American
3 mexican South American
Desired output
NAME Identifier
0 1990 German Euro
1 german 1998 Euro
2 country Mexican South American
3 mexican city 2006 South American
Based on an answer in this post:
r = '(german|mexican)'
c = dict(german='Euro', mexican='South American')
df['Identifier'] = df['NAME'].str.lower().str.extract(r, expand=False).map(c)
Another approach would be using np.where with those two conditions, but probably there is a more ellegant solution.
below code will work. i tried it using apply function but somehow can't able to get it. probably in sometime. meanwhile workable code below
df3['identifier']=''
js_ref=[{'german':'Euro'},{'mexican':'South American'}]
for i in range(len(df3)):
for l in js_ref:
for k,v in l.items():
if k.lower() in df3.name[i].lower():
df3.identifier[i]=v
break
i have a dataframe
Name
Joe Smith
Jane Doe
Homer Simpson
i am trying to format this to get to
Name
Smith, Joe
Doe, Jane
Simpson, Homer
i have this code, and it works for ~ 80% of users in my list but some users are not coming through right.
invalid_users = ['Test User', 'Test User2', 'Test User3']
for index, row in df_Users.iterrows():
gap_pos = df_Users["Name"][index].find(" ")
if gap_pos > 0 and row["Name"] not in invalid_users:
row["Name"] = df_Users["Name"][index][len(df_Users["Name"][index])-gap_pos+1:].strip() +', ' + df_Users["Name"][index][:gap_pos]
the users who are not coming through correctly, usually their last name is truncated somewhere - i.e. Simpson ==> mpson
What am I doing wrong here?
Just split on space, then reverse it (that's what .str[::-1] is doing) and join on , :
>>> df['Name'].str.split(' ').str[::-1].str.join(', ')
0 Smith, Joe
1 Doe, Jane
2 Simpson, Homer
Name: Name, dtype: object
And if your data contains the name like Jr. Joe Smith, then you may do it following way:
df['Name'].str.split(' ').str[::-1].apply(lambda x:(x[0],' '.join(x[1:]))).str.join(', ')
I'm not sure what you were trying to with len there, but it's not right. You just want to start straight from gap_pos:
row["Name"] = df_Users["Name"][index][gap_pos+1:].strip() +', ' + df_Users["Name"][index][:gap_pos]
I would be tempted to use split for this.
Pandas is a library that takes profit of vectorial operations, especially for simple transformations and most of DataFrame manipulations.
Given your example, here is a code that would work:
import pandas as pd
df = pd.DataFrame({"name": ["Joe Smith", "Jane Doe", "Homer Simpson"]})
# df
# name
# 0 Joe Smith
# 1 Jane Doe
# 2 Homer Simpson
df["name"] = df["name"].apply(lambda x: f"{x.split(' ')[1]}, {x.split(' ')[0]}")
# df
# name
# 0 Smith, Joe
# 1 Doe, Jane
# 2 Simpson, Homer
The apply function takes every row and applies the specified function to each one of them.
Here, the specified function is a lambda function that, supposing the name pattern is "FirstName LastName", does what you want.
I am trying to import a dataset from a text file, which looks like this.
id book author
1 Cricket World Cup: The Indian Challenge Ashis Ray
2 My Journey Dr. A.P.J. Abdul Kalam
3 Making of New India Dr. Bibek Debroy
4 Whispers of Time Dr. Krishna Saksena
When I used for importing:
df = pd.read_csv('book.txt', sep=' ')
it results into:
and when I use:
df = pd.read_csv('book.txt')
it results into:
Is there a way to get something like:
Any help on this will be appreciated. Thank you
Try with tab as a seperator:
df = pd.read_csv('book.txt', sep='\t')
So I have three pandas dataframes(train, test). Overall it is about 700k lines. And I would like to remove all cities from a cities list - common_cities. But tqdm in notebook cell suggests that it would take about 24 hrs to replace all from a list of 33000 cities.
dataframe example (train_original):
id
name_1
name_2
0
sun blinds decoration paris inc.
indl de cuautitlan sa cv
1
eih ltd. dongguan wei shi
plastic new york product co., ltd.
2
jsh ltd. (hk) mexico city
arab shipbuilding seoul and repair yard madrid c
common_cities list example
common_cities = ['moscow', 'madrid', 'san francisco', 'mexico city']
what is supposed to be output:
id
name_1
name_2
0
sun blinds decoration inc.
indl de sa cv
1
eih ltd. wei shi
plastic product co., ltd.
2
jsh ltd. (hk)
arab shipbuilding and repair yard c
My solution in such case worked well on small filter words list, but when it is large, the performance is low.
%%time
for city in tqdm(common_cities):
train_original.replace(re.compile(fr'\b({city})\b'), '', inplace=True)
train_augmented.replace(re.compile(fr'\b({city})\b'), '', inplace=True)
test.replace(re.compile(fr'\b({city})\b'), '', inplace=True)
P.S: I presume it's not great to use list comprehension while splitting string and substituting city name, because city name could be > 2 words.
Any suggestions, ideas on approach to make a quick replacement on Pandas Dataframes in such situations?
Instead of iterating over the huge dfs for reach pass, remember that pandas replace accepts dictionaries with all the replacements to be done in a single go.
Therefore we can start by creating the dictionary and then using it with replace:
replacements = {x:'' for x in common_cities}
train_original = train_original.replace(replacements)
train_augmented = train_augmented.replace(replacements)
test = test.replace(replacements)
Edit: Reading the documentation it might be even easier, because it also accept lists of values to be replaced:
train_original = train_original.replace(common_cities,'')
train_augmented = train_augmented.replace(common_cities,'')
test = test.replace(common_cities,'')