How to get all the unique words in the data frame? - python

I have a dataframe with a list of products and its respective review
+---------+------------------------------------------------+
| product | review |
+---------+------------------------------------------------+
| product_a | It's good for a casual lunch |
+---------+------------------------------------------------+
| product_b | Avery is one of the most knowledgable baristas |
+---------+------------------------------------------------+
| product_c | The tour guide told us the secrets |
+---------+------------------------------------------------+
How can I get all the unique words in the data frame?
I made a function:
def count_words(text):
try:
text = text.lower()
words = text.split()
count_words = Counter(words)
except Exception, AttributeError:
count_words = {'':0}
return count_words
And applied the function to the DataFrame, but that only gives me the words count for each row.
reviews['words_count'] = reviews['review'].apply(count_words)

Starting with this:
dfx
review
0 United Kingdom
1 The United Kingdom
2 Dublin, Ireland
3 Mardan, Pakistan
To get all words in the "review" column:
list(dfx['review'].str.split(' ', expand=True).stack().unique())
['United', 'Kingdom', 'The', 'Dublin,', 'Ireland', 'Mardan,', 'Pakistan']
To get counts of "review" column:
dfx['review'].str.split(' ', expand=True).stack().value_counts()
United 2
Kingdom 2
Mardan, 1
The 1
Ireland 1
Dublin, 1
Pakistan 1
dtype: int64 ​

Related

Best way to remove specific words from column in pandas dataframe?

I'm working with a huge set of data that I can't work with in excel so I'm using Pandas/Python, but I'm relatively new to it. I have this column of book titles that also include genres, both before and after the title. I only want the column to contain book titles, so what would be the easiest way to remove the genres?
Here is an example of what the column contains:
Book Labels
Science Fiction | Drama | Dune
Thriller | Mystery | The Day I Died
Thriller | Razorblade Tears | Family | Drama
Comedy | How To Marry Keanu Reeves In 90 Days | Drama
...
So above, the book titles would be Dune, The Day I Died, Razorblade Tears, and How To Marry Keanu Reeves In 90 Days, but as you can see the genres precede as well as succeed the titles.
I was thinking I could create a list of all the genres (as there are only so many) and remove those from the column along with the "|" characters, but if anyone has suggestions on a simpler way to remove the genres and "|" key, please help me out.
It is an enhancement to #tdy Regex solution. The original regex Family|Drama will match the words "Family" and "Drama" in the string. If the book title contains the words in gernes, the words will be removed as well.
Supposed that the labels are separated by " | ", there are three match conditions we want to remove.
Gerne at start of string. e.g. Drama | ...
Gerne in the middle. e.g. ... | Drama | ...
Gerne at end of string. e.g. ... | Drama
Use regex (^|\| )(?:Family|Drama)(?=( \||$)) to match one of three conditions. Note that | Drama | Family has 2 overlapped matches, here I use ?=( \||$) to avoid matching once only. See this problem [Use regular expressions to replace overlapping subpatterns] for more details.
>>> genres = ["Family", "Drama"]
>>> df
# Book Labels
# 0 Drama | Drama 123 | Family
# 1 Drama 123 | Drama | Family
# 2 Drama | Family | Drama 123
# 3 123 Drama 123 | Family | Drama
# 4 Drama | Family | 123 Drama
>>> re_str = "(^|\| )(?:{})(?=( \||$))".format("|".join(genres))
>>> df['Book Labels'] = df['Book Labels'].str.replace(re_str, "", regex=True)
# 0 | Drama 123
# 1 Drama 123
# 2 | Drama 123
# 3 123 Drama 123
# 4 | 123 Drama
>>> df["Book Labels"] = df["Book Labels"].str.strip("| ")
# 0 Drama 123
# 1 Drama 123
# 2 Drama 123
# 3 123 Drama 123
# 4 123 Drama

How can I split a text between commas and make each new part a new column?

So, I have a Pandas dataframe with food and cuisine some people like. I have to break those into columns, so each food or cuisine should become a column. Each food/cuisine comes afer a comma, but if i break my string only by commas, I'll lost the content inside the parenthesis, which should be there, close to the dish. I think I should use '),' as a separator, right? But I don't know how to do that. This is my DF:
>>> PD_FOODS
USER_ID | FOODS_I_LIKE |
_______________________________________________________________________________
0 100 | Pizza(without garlic, tomatos and onion),pasta |
1 101 | Seafood,veggies |
2 102 | Indian food (no pepper, no curry),mexican food(no pepper) |
3 103 | Texmex, african food, japanese food,italian food |
4 104 | Seafood(no shrimps, no lobster),italian food(no gluten, no milk)|
Is it possible to get a result like this bellow?
>>> PD_FOODS
USER_ID | FOODS_I_LIKE_1 | FOODS_I_LIKE_2 |
_______________________________________________________________________________
0 100 | Pizza(without garlic, tomatos and onion)| pasta |
Thank you!
Try this:
df=pd.DataFrame({"User_ID":[1000,1001,1002,1003,1004],
"FOODS_I_LIKE":['Pizza(without garlic, tomatos and onion),pasta',
'Seafood,veggies',
'Indian food (no pepper, no curry),mexican food(no pepper)',
'Texmex, african food, japanese food,italian food',
'Seafood(no shrimps, no lobster),italian food(no gluten, no milk)']})
def my_func(my_string, item_num):
try:
if ')' in my_string:
if item_num == 0:
return my_string.split('),')[item_num]+')'
else:
return my_string.split('),')[item_num]
else:
return my_string.split(',')[item_num]
except IndexError:
return np.nan
for k in range(0,4):
K=str(k+1)
df[f'FOODS_I_LIKE_{K}']=df.FOODS_I_LIKE.apply(lambda x: my_func(x, k))
df.drop(columns='FOODS_I_LIKE')
Output:
User_ID
FOODS_I_LIKE_1
FOODS_I_LIKE_2
FOODS_I_LIKE_3
FOODS_I_LIKE_4
1000
Pizza(without garlic, tomatos and onion)
pasta
NaN
NaN
1001
Seafood
veggies
NaN
NaN
1002
Indian food (no pepper, no curry)
mexican food(no pepper)
NaN
NaN
1003
Texmex
african food
japanese food
italian food
1004
Seafood(no shrimps, no lobster)
italian food(no gluten, no milk)
NaN
NaN
You could use a regex with a negative lookahead:
(df['FOODS_I_LIKE'].str.split(',\s*(?![^()]*\))', expand=True)
.rename(columns=lambda x: int(x)+1)
.add_prefix('FOODS_I_LIKE_')
)
output:
FOODS_I_LIKE_1 FOODS_I_LIKE_2 FOODS_I_LIKE_3 FOODS_I_LIKE_4
0 Pizza(without garlic, tomatos and onion) pasta None None
1 Seafood veggies None None
2 Indian food (no pepper, no curry) mexican food(no pepper) None None
3 Texmex african food japanese food italian food
4 Seafood(no shrimps, no lobster) italian food(no gluten, no milk) None None
You can test the regex here
NB. this won't work on nested parenthesis, you would need to use a parser

Split lists with uncertain elements into different categories (using pandas)

I am having trouble with a pandas split. So I have a column of data that looks something like this:
Initial Dataframe
index | Address
0 | [123 New York St]
1 | [Amazing Building, 23 New Jersey St, 2F]
2 | [98 New Mexico Ave, 16F]
3 | [White House, 1600 Pennsylvania Ave, PH]
4 | [221 Baker Street]
5 | [Hogwarts]
As you can see, the list contains varying categories and number of elements. Some have building names along with addresses. Some only have addresses with building floors. I want to sort them out by category (building name, address, unit/floor number) but I'm having trouble coming up with a solution to this, as I'm a beginner python & pandas learner.
How do I split the addresses into different categories to get an output that looks like this, assuming the building names ALL start with an alphabet and I can put Null for categories with missing value?
Desired Output:
index | Building Name | Address | Unit Number
0 | Null | 123 New York St | Null
1 | Amazing Building | 23 New Jersery St. | 2F
2 | Null | 98 New Mexico Ave. | 16F
3 | White House | 1600 Pennsylvania Ave | PH
4 | Null | 221B Baker St | Null
5 | Hogwarts | Null | Null
The main thing I need is for all addresses to be in the Address Column. Thanks for any help!
preconditional condition : Building name starts with a character, not a number
If the building name starts with a number, the wrong result can be output.
import pandas as pd
df = pd.DataFrame({'addr' : ['123 New York St',
'Amzing Building, 23 New Jersey St, 2F',
'98 New Mexico Ave, 16F']})
# Check the number of items in the address value
df['addr'] = df['addr'].str.split(',')
df['cnt'] = df['addr'].apply(lambda x: len(x)).values
# function, Building name start letter check
def CheckInt(s):
try:
int(s[0])
return True
except ValueError:
return False
for i, v in df.iterrows():
# One item of address value
if v.cnt == 1:
df.loc[i,'Address'] = v.addr
# Three items of address value
elif v.cnt == 3:
df.loc[i,'Building'] = v.addr[0]
df.loc[i,'Address'] = v.addr[1]
df.loc[i,'Unit'] = v.addr[2]
# Two items of address value
else:
if CheckInt(v.addr[0]):
df.loc[i,'Address'] = v.addr[0]
df.loc[i,'Unit'] = v.addr[1]
else:
df.loc[i,'Building'] = v.addr[0]
df.loc[i,'Address'] = v.addr[1]
We can get the output for your input dataframe as below.
If the data is different, you may have to tinker around.
df['com_Address'] = df[' Address'].apply(lambda x: x.replace('[','').replace(']','')).str.split(',')
st_list= ['St','Ave']
df['St_Address']=df.apply(lambda x: [a if st in a else '' for st in st_list for a in x['com_Address']],axis=1)
df['St_Address']=df['St_Address'].apply(lambda x:[i for i in x if i]).astype(str).apply(lambda x: x.strip("[]'"))
df['Building Name']=df.apply(lambda x: [x['com_Address'][0] if len(x['com_Address'])==3 else 'Null'],axis=1).astype(str).apply(lambda x: x.strip("[]'"))
df['Building Name']=df.apply(lambda x: np.where((len(x['com_Address'])==1) & (x['St_Address']==''),x['com_Address'][0],x['Building Name']),axis=1)
df['Unit Number']=df.apply(lambda x: [x['com_Address'][2] if len(x['com_Address'])==3 else 'Null'],axis=1).astype(str).apply(lambda x: x.strip("[]'"))
df['Unit Number']=df.apply(lambda x: np.where((len(x['com_Address'])==2) & (x['St_Address']!=''),x['com_Address'][-1],x['Unit Number']),axis=1)
df
Column "com_Address" is optional. I had to create it because the 'Address' from your input came to me as a string & not as a list. If you already have it as list, you don't need this & you will have to update "com_Address" with 'Address' in the code.
Output
index Address com_Address Building Name St_Address Unit Number
0 0 [123 New York St] [ 123 New York St] Null 123 New York St Null
1 1 [Amazing Building, 23 New Jersey St, 2F] [ Amazing Building, 23 New Jersey St, 2F] Amazing Building 23 New Jersey St 2F
2 2 [98 New Mexico Ave, 16F] [ 98 New Mexico Ave, 16F] Null 98 New Mexico Ave 16F
3 3 [White House, 1600 Pennsylvania Ave, PH] [ White House, 1600 Pennsylvania Ave, PH] White House 1600 Pennsylvania Ave PH
4 4 [221 Baker Street] [ 221 Baker Street] Null 221 Baker Street Null
5 5 [Hogwarts] [ Hogwarts] Hogwarts Null

printing multiple sections of text between two markers in python

I converted this page (it's squad lists for different sports teams) from PDF to text using this code:
import PyPDF3
import sys
import tabula
import pandas as pd
#One method
pdfFileObj = open(sys.argv[1],'rb')
pdfReader = PyPDF3.PdfFileReader(pdfFileObj)
num_pages = pdfReader.numPages
count = 0
text = ""
while count < num_pages:
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText()
print(text)
The output looks like this:
2019 SEASON
PREMIER DIVISION SQUAD NUMBERS
CLUB: BOHEMIANS
1
James Talbot
GK
2
Derek Pender
DF
3
Darragh Leahy
DF
.... some more names....
2019 SEASON
PREMIER DIVISION SQUAD NUMBERS
CLUB: CORK CITY
1
Mark McNulty
GK
2
Colm Horgan
DF
3
Alan Bennett
DF
....some more names....
2019 SEASON
PREMIER DIVISION SQUAD NUMBERS
CLUB: DERRY CITY
1
Peter Cherrie
GK
2
Conor McDermott
DF
3
Ciaran Coll
DF
I wanted to transform this output to a tab delimited file with three columns: team name, player name, and number. So for the example I gave, the output would look like:
Bohemians James Talbot 1
Bohemians Derek Pender 2
Bohemians Darragh Leahy 3
Cork City Mark McNulty 1
Cork City Colm Horgan 2
Cork City Alan Bennett 3
Derry City Peter Cherrie 1
Derry City Conor McDermott 2
Derry City Ciaran Coll 3
I know I need to first (1) Divide the file into sections based on team, and then (2) within each team section; combine each name + number field into pairs to assign each number to a name.
I wrote this little bit of code to parse the big file into each sports team:
import sys
fileopen = open(sys.argv[1])
recording = False
for line in fileopen:
if not recording:
if line.startswith('PREMI'):
recording = True
elif line.startswith('2019 SEA'):
recording = False
else:
print(line)
But I'm stuck, because the above code won't divide up the block of text per team (i.e. i need multiple blocks of text extracted to separate strings or lists?). Can someone advise how to divide up the text file I have per team (so in this example, I should be left with three blocks of text...and then somehow I can work on each team-divided block of text to pair numbers and names).
Soooo, not necessarily true to form and I don't take into consideration the other libraries you'd used, but it was designed to give you a start. You can reformat it however you wish.
>>> string = '''2019 SEASON
PREMIER DIVISION SQUAD NUMBERS
CLUB: BOHEMIANS
1
James Talbot
GK
2
Derek Pender
DF
3
Darragh Leahy
DF
.... some more names....
2019 SEASON
PREMIER DIVISION SQUAD NUMBERS
CLUB: CORK CITY
1
Mark McNulty
GK
2
Colm Horgan
DF
3
Alan Bennett
DF
....some more names....
2019 SEASON
PREMIER DIVISION SQUAD NUMBERS
CLUB: DERRY CITY
1
Peter Cherrie
GK
2
Conor McDermott
DF
3
Ciaran Coll
DF'''
>>> def reorder(string):
import re
headers = ['Team', 'Name', 'Number']
print('\n')
print(headers)
print()
paragraphs = re.findall('2019[\S\s]+?(?=2019|$)', string)
for paragraph in paragraphs:
club = re.findall('(?i)CLUB:[\s]*([\S\s]+?)\n', paragraph)
names_numbers = re.findall('(?i)([\d]+)[\n]{1,3}[\s]*([\S\ ]+)', paragraph)
for i in range(len(names_numbers)):
if len(club) == 1:
print(club[0]+' | '+names_numbers[i][1]+' | '+names_numbers[i][0])
>>> reorder(string)
['Team', 'Name', 'Number']
BOHEMIANS | James Talbot | 1
BOHEMIANS | Derek Pender | 2
BOHEMIANS | Darragh Leahy | 3
CORK CITY | Mark McNulty | 1
CORK CITY | Colm Horgan | 2
CORK CITY | Alan Bennett | 3
DERRY CITY | Peter Cherrie | 1
DERRY CITY | Conor McDermott | 2
DERRY CITY | Ciaran Coll | 3

Splitting multiple pipe delimited values in multiple columns of a comma delimited CSV and mapping them to each other

I have a csv with comma delimiters that has multiple values in a column that are delimited by a pipe and I need to map them to another column with multiple pipe delimited values and then give them their own row along with data in the original row that doesn't have multiple values. My CSV looks like this (with commas between the categories):
row name city amount
1 frank | john | dave toronto | new york | anaheim 10
2 george | joe | fred fresno | kansas city | reno 20
I need it to look like this:
row name city amount
1 frank toronto 10
2 john new york 10
3 dave anaheim 10
4 george fresno 20
5 joe kansas city 20
6 fred reno 20
Maybe not the nicest but working solution:
(works with no piped lines and for different pipe-length)
df = pd.read_csv('<your_data>.csv')
str_split = ' | '
# Calculate maximum length of piped (' | ') values
df['max_len'] = df[['name', 'city']].apply(lambda x: max(len(x[0].split(str_split)),
len(x[0].split(str_split))), axis=1)
max_len = df['max_len'].max()
# Split '|' piped cell values into columns (needed at unpivot step)
# Create as many new 'name_<x>' & 'city_<x>' columns as 'max_len'
df[['name_{}'.format(i) for i in range(max_len)]] = df['name'].apply(lambda x: \
pd.Series(x.split(str_split)))
df[['city_{}'.format(i) for i in range(max_len)]] = df['city'].apply(lambda x: \
pd.Series(x.split(str_split)))
# Unpivot 'name_<x>' & 'city_<x>' columns into rows
df_pv_name = pd.melt(df, value_vars=['name_{}'.format(i) for i in range(max_len)],
id_vars=['amount'])
df_pv_city = pd.melt(df, value_vars=['city_{}'.format(i) for i in range(max_len)],
id_vars=['amount'])
# Rename upivoted columns (these are the final columns)
df_pv_name = df_pv_name.rename(columns={'value':'name'})
df_pv_city = df_pv_city.rename(columns={'value':'city'})
# Rename 'city_<x>' values (rows) to be 'key' for join (merge)
df_pv_city['variable'] = df_pv_city['variable'].map({'city_{}'.format(i):'name_{}'\
.format(i) for i in range(max_len)})
# Join unpivoted 'name' & 'city' dataframes
df_res = df_pv_name.merge(df_pv_city, on=['variable', 'amount'])
# Drop 'variable' column and NULL rows if you have not equal pipe-length in original rows
# If you want to drop any NULL rows then replace 'all' to 'any'
df_res = df_res.drop(['variable'], axis=1).dropna(subset=['name', 'city'], how='all',
axis=0).reset_index(drop=True)
The result is:
amount name city
0 10 frank toronto
1 20 george fresno
2 10 john new york
3 20 joe kansas city
4 10 dave anaheim
5 20 fred reno
Another test input:
name city amount
0 frank | john | dave | joe | bill toronto | new york | anaheim | los angeles | caracas 10
1 george | joe | fred fresno | kansas city 20
2 danny miami 30
Result of this test (if you don't want NaN rows then replace how='all' to how='any' in the code at merging):
amount name city
0 10 frank toronto
1 20 george fresno
2 30 danny miami
3 10 john new york
4 20 joe kansas city
5 10 dave anaheim
6 20 fred NaN
7 10 joe los angeles
8 10 bill caracas
Given a row:
['1','frank|joe|dave', 'toronto|new york|anaheim', '20']
you can use
itertools.izip_longest(*[value.split('|') for value in row])
on it to obtain following structure:
[('1', 'frank', 'toronto', '20'),
(None, 'joe', 'new york', None),
(None, 'dave', 'anaheim', None)]
Here we want to replace all None values with last seen value in corresponding column. Can be done when looping over result.
So given a TSV already splitted by tabs following code should do the trick:
import itertools
def flatten_tsv(lines):
result = []
for line in lines:
flat_lines = itertools.izip_longest(*[value.split('|') for value in line])
for flat_line in flat_lines:
result.append([result[-1][i] if v is None else v
for i, v in enumerate(flat_line)])
return result

Categories

Resources