List of unique characters of a dataset - python

I have a dataset in a dataframe and I want to see the total number of characters and the list of unique characters.
As for the total number of characters I have implemented the following code which seems is working well
df["Preprocessed_Text"].str.len().sum()
Could you please let me know how to get a list with the unique characters (not including the space)?

Try this:
from string import ascii_letters
chars = set(''.join(df["Preprocessed_Text"])).intersection(ascii_letters)
If you need to work with a different alphabet, then simply replace ascii_letters with whatever you need.
If you want every character but the space then:
chars = set(''.join(df["Preprocessed_Text"]).replace(' ', ''))

unichars = list(''.join(df["Preprocessed_Text"]))
print(sorted(set(unichars), key=unichars.index))

unique = list(set([letter for letter in ''.join(df['Processed_text'].values) if letter != " "]))

Related

Print a string without any other characters except letters, and replace the space with an underscore

I need to print a string, using this rules:
The first letter should be capital and make all other letters are lowercase. Only the characters a-z A-Z are allowed in the name, any other letters have to be deleted(spaces and tabs are not allowed and use underscores are used instead) and string could not be longer then 80 characters.
It seems to me that it is possible to do it somehow like this:
name = "hello2 sjsjs- skskskSkD"
string = name[0].upper() + name[1:].lower()
lenght = len(string) - 1
answer = ""
for letter in string:
x = letter.isalpha()
if x == False:
answer = string.replace(letter,"")
........
return answer
I think it's better to use a for loop or isalpha () here, but I can't think of a better way to do it. Can someone tell me how to do this?
For one-to-one and one-to-None mappings of characters, you can use the .translate() method of strings. The string module provides lists (strings) of the various types of characters including one for all letters in upper and lowercase (string.ascii_letters) but you could also use your own constant string such as 'abcdef....xyzABC...XYZ'.
import string
def cleanLetters(S):
nonLetters = S.translate(str.maketrans('','',' '+string.ascii_letters))
return S.translate(str.maketrans(' ','_',nonLetters))
Output:
cleanLetters("hello2 sjsjs- skskskSkD")
'hello_sjsjs_skskskSkD'
One method to accomplish this is to use regular expressions (regex) via the built-in re library. This enables the capturing of only the valid characters, and ignoring the rest.
Then, using basic string tools for the replacement and capitalisation, then a slice at the end.
For example:
import re
name = 'hello2 sjsjs- skskskSkD'
trans = str.maketrans({' ': '_', '\t': '_'})
''.join(re.findall('[a-zA-Z\s\t]', name)).translate(trans).capitalize()[:80]
>>> 'Hello_sjsjs_skskskskd'
Strings are immutable, so every time you do string.replace() it needs to iterate over the entire string to find characters to replace, and a new string is created. Instead of doing this, you could simply iterate over the current string and create a new list of characters that are valid. When you're done iterating over the string, use str.join() to join them all.
answer_l = []
for letter in string:
if letter == " " or letter == "\t":
answer_l.append("_") # Replace spaces or tabs with _
elif letter.isalpha():
answer_l.append(letter) # Use alphabet characters as-is
# else do nothing
answer = "".join(answer_l)
With string = 'hello2 sjsjs- skskskSkD', we have answer = 'hello_sjsjs_skskskSkD';
Now you could also write this using a generator expression instead of creating the entire list and then joining it. First, we define a function that returns the letter or "_" for our first two conditions, and an empty string for the else condition
def translate(letter):
if letter == " " or letter == "\t":
return "_"
elif letter.isalpha():
return letter
else:
return ""
Then,
answer = "".join(
translate(letter) for letter in string
)
To enforce the 80-character limit, just take answer[:80]. Because of the way slices work in python, this won't throw an error even when the length of answer is less than 80.

How To Extract Three Letters Followed By Five Digits Using Regex in Python

I have the following dataframe in Python:
abc12345
abc1234
abc1324.
How do I extract only the ones that have three letters followed by five digits?
The desired result would be:
abc12345.
df.column.str.extract('[^0-9](\d\d\d\d\d)$')
I think this works, but is there any better way to modify (\d\d\d\d\d) ?
What if I had like 30 digits. Then I'll have to type \d 30 times, which is inefficient.
You should be able to use:
'[a-zA-Z]{3}\d{5}'
If the strings don't include capital letters this can reduce to:
'[a-z]{3}\d{5}'
Change the values in the {x} to adjust the number of chars to capture.
Or like this following code:
'
import re
s = "abc12345"
p = re.compile(r"\d{5}")
c = p.match(s,3)
print(c.group())
'

Regex: How to match any letters up until a number And Matching a Dash

I am trying to match sequences of strings that follow certain rules:
rlg3-22, rlas1-4
pz
xx-0
r1-6
For example, in the first row, I want to match the string up until the "-" character, so that I can perform the following function of expanding the string into (rlg3, rlg4, ..., rlg22).
In the second row, I would leave it as is.
In the third row, I would also leave it as is because there was no number first.
Thank you!
d = 'rlg3-22'
import re
ops = re.findall(r"\d+",d) # r"\d+" searches for digits of variables length
prefix = re.findall(r"\D+", d)[0] # r"\D+" complement set of "\d+"
build the list and add the prefix to the string cast of the integer
[prefix + str(i) for i in list(range(int(ops[0]), int(ops[1]),1))]
['rgl3',
'rgl4',
'rgl5',
'rgl6',
'rgl7',
'rgl8',
'rgl9',
'rgl10',
'rgl11',
'rgl12',
'rgl13',
'rgl14',
'rgl15',
'rgl16',
'rgl17',
'rgl18',
'rgl19',
'rgl20',
'rgl21']

How do I start reading from a certain character in a string?

I have a list of strings that look something like this:
"['id', 'thing: 1\nother: 2\n']"
"['notid', 'thing: 1\nother: 2\n']"
I would now like to read the value of 'other' out of each of them.
I did this by counting the number at a certain position but since the position of such varies I wondererd if I could read from a certain character like a comma and say: read x_position character from comma. How would I do that?
Assuming that "other: " is always present in your strings, you can use it as a separator and split by it:
s = 'thing: 1\nother: 2'
_,number = s.split('other: ')
number
#'2'
(Use int(number) to convert the number-like string to an actual number.) If you are not sure if "other: " is present, enclose the above code in try-except statement.

Adding prefix to string in a file

Well i have a sort of telephone directory in a .txt file,
what i want to do is find all the numbers with this pattern e.g. 829-2234 and append the number 5 to the beginning of the numbers.
so the result now becomes 5829-2234.
my code begins like this:
import os
import re
count=0
#setup our regex
regex=re.compile("\d{3}-\d{4}\s"}
#open file for scanning
f= open("samplex.txt")
#begin find numbers matching pattern
for line in f:
pattern=regex.findall(line)
#isolate results
for word in pattern:
print word
count=count+1 #calculate number of occurences of 7-digit numbers
# replace 7-digit numbers with 8-digit numbers
word= '%dword' %5
well i don't really know how to append the prefix 5 and then overwrite the 7-digit number with 7-digit number with 5 prefix. I tried a few things but all failed :/
Any tip/help would be greatly appreciated :)
Thanks
You're almost there, but you got your string formatting the wrong way. As you know that 5 will always be in the string (because you're adding it), you do:
word = '5%s' % word
Note that you can also use string concatenation here:
word = '5' + word
Or even use str.format():
word = '5{}'.format(word)
If you're doing it with regex then use re.sub:
>>> strs = "829-2234 829-1000 111-2234 "
>>> regex = re.compile(r"\b(\d{3}-\d{4})\b")
>>> regex.sub(r'5\1', strs)
'5829-2234 5829-1000 5111-2234 '

Categories

Resources