Edit: Finally figured it out myself. I kept using select() on column within the function, that's why it didn't work. I added my solution as comments withint the original question just in case it might be of use for somebody else.
I'm working on an online course where I'm supposed to write the following function:
# TODO: Replace <FILL IN> with appropriate code
# Note that you shouldn't use any RDD operations or need to create custom user defined functions (udfs) to accomplish this task
from pyspark.sql.functions import regexp_replace, trim, col, lower
def removePunctuation(column):
"""Removes punctuation, changes to lower case, and strips leading and trailing spaces.
Note:
Only spaces, letters, and numbers should be retained. Other characters should should be
eliminated (e.g. it's becomes its). Leading and trailing spaces should be removed after
punctuation is removed.
Args:
column (Column): A Column containing a sentence.
Returns:
Column: A Column named 'sentence' with clean-up operations applied.
"""
# EDIT: MY SOLUTION
# column = lower(column)
# column = regexp_replace(column, r'([^a-z\d\s])+', r'')
# return trim(column).alias('sentence')
return <FILL IN>
sentenceDF = sqlContext.createDataFrame([('Hi, you!',),
(' No under_score!',),
(' * Remove punctuation then spaces * ',)], ['sentence'])
sentenceDF.show(truncate=False)
(sentenceDF
.select(removePunctuation(col('sentence')))
.show(truncate=False))
I've written the code that gives me the required output for operations on the DataFrame itself:
# Lower case
lower = sentenceDF.select(lower(col('sentence')).alias('lower'))
lower.show()
# Remove Punctuation
cleaned = lower.select(regexp_replace(col('lower'), r'([^a-z\d\s])+', r'').alias('cleaned'))
cleaned.show()
# Trim
sentenceDF = cleaned.select(trim(col('cleaned')).alias('sentence'))
sentenceDF.show(truncate=False)
I just don't know, how to implement this code in my function, since it doesn't operate on the DataFrame, but only on the given column. I've tried different approaches, one was to create a new DataFrame out of the column input using
[...]
df = sqlContext.createDataFrame(column, ['sentence'])
[...]
within the function, but it doesn't work: TypeError: Column is not iterable. Other approaches were trying to directly operate on column within the function, always leading to TypeError: 'Column' object is not callable.
I've started with (Py)Spark a few days ago and still have conceptual problems regarding how to deal with rows and columns only. I would really appreciate any kind of help on the current issue.
You can do this in a single line.
return re.sub(r'[^a-z0-9\s]','',text.lower().strip()).strip()
Related
How do I replace the cell values in a column if they contain a number in general or contain a specific thing like a comma, replace the whole cell value with something else.
Say for example a column that has a comma meaning it has more than one thing I want it to be replaced by text like "ENM".
For a column that has a cell with a number value, I want to replace it by 'UNM'
As you have not provided examples of what your expected and current output look like, I'm making some assumptions below. What it seems like you're trying to do is iterate through every value in a column and if the value meets certain conditions, change it to something else.
Just a general pointer. Iterating through dataframes requires some important considerations for larger sizes. Read through this answer for more insight.
Start by defining a function you want to use to check the value:
def has_comma(value):
if ',' in value:
return True
return False
Then use the pandas.DataFrame.replace method to make the change.
for i in df['column_name']:
if has_comma(i):
df['column_name'] = df['column_name'].replace([i], 'ENM')
else:
df['column_name'] = df['column_name'].replace([i], 'UNM')
Say you have a column, i.e. pandas Series called col
The following code can be used to map values with comma to "ENM" as per your example
col.mask(col.str.contains(','), "ENM")
You can overwrite your original column with this result if that's what you want to do. This approach will be much faster than looping through each element.
For mapping floats to "UNM" as per your example the following would work
col.mask(col.apply(isinstance, args=(float,)), "UNM")
Hopefully you get the idea.
See https://pandas.pydata.org/docs/reference/api/pandas.Series.mask.html for more info on masking
I'm new to Python, If i have this tuple
testGrid = [['p','c','n','d','t','h','g'],
['w','a','x','o','a','x','f'],
['o','t','w','g','d','r','k'],
['l','j','p','i','b','e','t'],
['f','v','l','t','o','w','n']]
How can I print it out so that it reads without any commas and without spaces? And new lines after each row?
pcndthg
waxoaxf
otwgdrk
ljpibet
fvltown
Use join() to concatenate all the strings in a list.
for row in testGrid:
print(''.join(row))
or change the default separator to an empty string.
for row in testGrid:
print(*row, sep='')
Barmar's answer is likely the most efficient possible way to do this in Python, but for the sake of learning programming logic, here is an answer that guides you through the process step by step:
First of all, in a nested list, usually 2 layers of loops are required (if no helper or built-in functions are used). Hence our first layer of for loop will have a 1D list as an element.
for row in testGrid:
print("something")
# row = ['p','c','n','d','t','h','g']
So within this loop, we attempt to loop through each alphabet in row:
for char in row:
print(char)
# char = 'p'
Since the built-in print() function in Python will move to the next line by default, we try to use a string variable to "stack" all characters before outputting it:
for row in testGrid:
# loop content applies to each row
# define string variable
vocab = ""
for char in row:
# string concatenation (piecing 2 strings together)
vocab = vocab + char
# vocab now contains the entire row, pieced into one string
print(vocab)
# remark: usually in other programming languages, moving cursor to the next line requires extra coding
# in Python it is not required but it is still recommended to keep this in mind
Hopefully this helps you understand programming concepts and flows better!
New Python user here, so please pardon my ignorance if my approach seems completely off.
I am having troubles filtering rows of a column based off of their Character/Number format.
Here's an example of the DataFrame and Series
df = {'a':[1,2,4,5,6], 'b':[7, 8, 9,10 ], 'target':[ 'ABC1234','ABC123', '123ABC', '7KZA23']
The column I am looking to filter is the "target" column based on their character/number combos and I am essentially trying to make a dict like below
{'ABC1234': counts_of_format
'ABC123': counts_of_format
'123ABC': counts_of_format
'any_other_format': counts_of_format}
Here's my progress so far:
col = df['target'].astype('string')
abc1234_pat = '^[A-Z]{3}[0-9]{4]'
matches = re.findall(abc1234_pat, col)
I keep getting this error:
TypeError: expected string or bytes-like object
I've double checked the dtype and it comes back as string. I've researched the TypeError and the only solutions I can find it converting it to a string.
Any insight or suggestion on what I might be doing wrong, or if this is simply the wrong approach to this problem, will be greatly appreciated!
Thanks in advance!
I am trying to create a dict that returns how many times the different character/number combos occur. For example, how many time does 3 characters followed by 4 numbers occur and so on.
(Your problem would have been earlier and easier understood had you stated this in the question post itself rather than in a comment.)
By characters, you mean letters; by numbers, you mean digits.
abc1234_pat = '^[A-Z]{3}[0-9]{4]'
Since you want to count occurrences of all character/number combos, this approach of using one concrete pattern would not lead very far. I suggest to transform the targets to a canonical form which serves as the key of your desired dict, e. g. substitute every letter with C and every digit with N (using your terms).
Of the many ways to tackle this, one is using str.translate together with a class which does the said transformation.
class classify():
def __getitem__(self, key):
return ord('C' if chr(key).isalpha() else 'N' if chr(key).isdigit() else None)
occ = df.target.str.translate(classify()).value_counts()#.todict()
Note that this will purposely raise an exception if target contains non-alphanumeric characters.
You can convert the resulting Series to a dict with .to_dict() if you like.
I have a dataframe that consists of lines that look like:
"{'displayName':'MartinscroftTramStop','locationIdentifier':'STATION^15306','normalisedSearchTerm':'MARTINSCROFTTRAMSTOP'}"
How do I split this into columns. I've tried str.slice[stop and start].
I suspect it's all the quotes but finding and replacing them don't seem to work either
You can handle the first problem, the string object, using the eval('..') function. It will return the execution of the string, so will return the dict itself.
The second one, the dict structure, you have multiple choices. There is one solution
import pandas as pd
# Transform the string in dict
dict_data=eval("{'displayName':'MartinscroftTramStop','locationIdentifier':'STATION^15306','normalisedSearchTerm':'MARTINSCROFTTRAMSTOP'}")
# Organize the data
columns_name = dict_data.keys()
data_list = [list(dict_data.values())] # a row must be a list inside a list
pd.DataFrame(data_list, columns=columns_name)
I am trying to make a new column depending on different criteria. I want to add characters to the string dependent on the starting characters of the column.
An example of the data:
RH~111~header~120~~~~~~~ball
RL~111~detailed~12~~~~~hat
RA~111~account~13~~~~~~~~~car
I want to change those starting with RH and RL, but not the ones starting with RA. So I want to look like:
RH~111~header~120~~1~~~~~ball
RL~111~detailed~12~~cancel~~~ball
RA~111~account~12~~~~~~~~~ball
I have attempted to use str split, but it doesn't seem to actually be splitting the string up
(np.where(~df['1'].str.startswith('RH'),
df['1'].str.split('~').str[5],
df['1']))
This is referencing the correct columns but not splitting it where I thought it would, and cant seem to get further than this. I feel like I am not really going about this the right way.
Define a function to replace element No pos in arr list:
def repl(arr, pos):
arr[pos] = '1' if arr[0] == 'RH' else 'cancel'
return '~'.join(arr)
Then perform the substitution:
df[0] = df[0].mask(df[0].str.match('^R[HL]'),
df[0].str.split('~').apply(repl, pos=5))
Details:
str.match provides that only proper elements are substituted.
df[0].str.split('~') splits the column of strings into a column
of lists (resulting from splitting of each string).
apply(repl, pos=5) computes the value to sobstitute.
I assumed that you have a DataFrame with a single column, so its column
name is 0 (an integer), instead of '1' (a string).
If this is not the case, change the column name in the code above.