Adding prefix to string in a file - python

Well i have a sort of telephone directory in a .txt file,
what i want to do is find all the numbers with this pattern e.g. 829-2234 and append the number 5 to the beginning of the numbers.
so the result now becomes 5829-2234.
my code begins like this:
import os
import re
count=0
#setup our regex
regex=re.compile("\d{3}-\d{4}\s"}
#open file for scanning
f= open("samplex.txt")
#begin find numbers matching pattern
for line in f:
pattern=regex.findall(line)
#isolate results
for word in pattern:
print word
count=count+1 #calculate number of occurences of 7-digit numbers
# replace 7-digit numbers with 8-digit numbers
word= '%dword' %5
well i don't really know how to append the prefix 5 and then overwrite the 7-digit number with 7-digit number with 5 prefix. I tried a few things but all failed :/
Any tip/help would be greatly appreciated :)
Thanks

You're almost there, but you got your string formatting the wrong way. As you know that 5 will always be in the string (because you're adding it), you do:
word = '5%s' % word
Note that you can also use string concatenation here:
word = '5' + word
Or even use str.format():
word = '5{}'.format(word)

If you're doing it with regex then use re.sub:
>>> strs = "829-2234 829-1000 111-2234 "
>>> regex = re.compile(r"\b(\d{3}-\d{4})\b")
>>> regex.sub(r'5\1', strs)
'5829-2234 5829-1000 5111-2234 '

Related

List of unique characters of a dataset

I have a dataset in a dataframe and I want to see the total number of characters and the list of unique characters.
As for the total number of characters I have implemented the following code which seems is working well
df["Preprocessed_Text"].str.len().sum()
Could you please let me know how to get a list with the unique characters (not including the space)?
Try this:
from string import ascii_letters
chars = set(''.join(df["Preprocessed_Text"])).intersection(ascii_letters)
If you need to work with a different alphabet, then simply replace ascii_letters with whatever you need.
If you want every character but the space then:
chars = set(''.join(df["Preprocessed_Text"]).replace(' ', ''))
unichars = list(''.join(df["Preprocessed_Text"]))
print(sorted(set(unichars), key=unichars.index))
unique = list(set([letter for letter in ''.join(df['Processed_text'].values) if letter != " "]))

substring with a small change

I'm trying to solve this problem were they give me a set of strings where to count how many times a certain word appears within a string like 'code' but the program also counts any variant where the 'd' changes like 'coze' but something like 'coz' doesn't count this is what I made:
def count(word):
count=0
for i in range(len(word)):
lo=word[i:i+4]
if lo=='co': # this is what gives me trouble
count+=1
return count
Test if the first two characters match co and the 4th character matches e.
def count(word):
count=0
for i in range(len(word)-3):
if word[i:i+1] == 'co' and word[i+3] == 'e'
count+=1
return count
The loop only goes up to len(word)-3 so that word[i+3] won't go out of range.
You could use regex for this, through the re module.
import re
string = 'this is a string containing the words code, coze, and coz'
re.findall(r'co.e', string)
['code', 'coze']
from there you could write a function such as:
def count(string, word):
return len(re.findall(word, string))
Regex is the answer to your question as mentioned above but what you need is a more refined regex pattern. since you are looking for certain word appears you need to search for boundary words. So your pattern should be sth. like this:
pattern = r'\bco.e\b'
this way your search will not match with the words like testcodetest or cozetest but only match with code coze coke but not leading or following characters
if you gonna test for multiple times, then it's better to use a compiled pattern, that way it'd be more memory efficient.
In [1]: import re
In [2]: string = 'this is a string containing the codeorg testcozetest words code, coze, and coz'
In [3]: pattern = re.compile(r'\bco.e\b')
In [4]: pattern.findall(string)
Out[4]: ['code', 'coze']
Hope that helps.

Python - Finding all numeric values in a string, then storing each numeric in a list uniquely

I would like to be able to grab any and all numeric values from a string if found. Then store them in a list individually.
Currently able to identify all numeric values, but not able to figure out how to store them individually.
phones = list()
comment = "Sues phone numbers are P#3774794773 and P#6047947730."
words = comment.split()
for word in words:
word = word.rstrip()
nums = re.findall(r'\d{10,10}',word)
if nums not in phones:
phones.append(nums)
print(phones)
I would like to get those two values to be stored as such.... 3774794773,6047947730. Instead of a list within a list.
End goal output (print) each value separately.
Current Print: [ [], ['3774794773'], ['6047947730'] ]
Needed Print: 3774794773, 6047947730
Thanks in advance.
You're doing a double job with the regex (split is also basically regex based) just do the whole thing with a 10 digit number matching regex, like so:
comment = "Sues phone numbers are P#3774794773 and P#6047947730."
nums = re.findall(r'\d{10,10}', comment)
print(nums)
If you want the numbers also to be exact (not to match longer sequences) you can do the following:
comment = "Sues phone numbers are P#3774794773 123145125125215 and P#6047947730."
nums = re.findall(r'\b\d{10,10}\b', comment)
print(nums)
(\b is an interesting regex symbol which doesn't really match a part of the string but rather matches "the space between characters" in the string)
both result in:
['3774794773', '6047947730']
Save your comment variable in a file and then use this code to separate them into variables
with open("CS.txt", "r") as f:
number1,number2 = f.read().split(" ")
print(number1)
print(number2)

How do I make it where the code ignores every symbol within a string except for those in a list?

I was trying to make a program that could be used for one-time pad encryption by counting the number of characters and having a random number for each one. I started making a line that would let the program ignore spaces, but then I realized I would also need to ignore other symbols. I had looked at How to count the number of letters in a string without the spaces? for the spaces,
and it proved very helpful. However, the answers only show how to remove one symbol at a time. To do what I would like by using that answer, I would have to have a long line of - how_long.count('character')'s, and symbols that I may not even know of may still be copied in. Thus, I am asking for a way where it will only count all the alphabetic characters I write down in a list. Is this possible, and if so, how would it be done?
My code:
import random
import sys
num = 0
how_long = input("Message (The punctuation will not be counted)\n Message: ")
charNum = len(how_long) - how_long.count(' ')
print("\n")
print("Shift the letters individually by their respective numbers.")
for num in range(0, charNum-1):
sys.stdout.write(str(random.randint(1, 25))+", ")
print(random.randint(1, 25))
If your desired outcome is to clean a string so it only contains a desired subset of characters the following will work but, I'm not sure I totally understand what your question is so you will probably have to modify somewhat.
desired_letters = 'ABCDOSTRY'
test_input = 'an apple a day keeps the doctor away'
cleaned = ''.join(l for l in test_input if l.upper() in desired_letters)
# cleaned == 'aaadaystdoctoraay'
Use Regex to find the number of letters in the input:
import re, sys, random
how_long = input("Message (The punctuation will not be counted)\n Message: ")
regex_for_letters = "[A-Za-z]"
letter_count = 0
for char in how_long:
check_letter = re.match(regex_for_letters, char)
if check_letter:
letter_count += 1
print(letter_count)
for num in range(0, letter_count-1):
sys.stdout.write(str(random.randint(1, 25))+", ")
print(random.randint(1, 25))
Filter the string:
source_string='My String'
allow_chars=['a','e','i','o','u'] #whatever characters you want to accept
source_string_list=list(source_string)
source_string_filtered=list(filter(lambda x: x in allow_chars,source_string_list))
the count would be: len(source_string_filtered)

Best way to convert string to integer in Python

I have a spreadsheet with text values like A067,A002,A104. What is most efficient way to do this? Right now I am doing the following:
str = 'A067'
str = str.replace('A','')
n = int(str)
print n
Depending on your data, the following might be suitable:
import string
print int('A067'.strip(string.ascii_letters))
Python's strip() command takes a list of characters to be removed from the start and end of a string. By passing string.ascii_letters, it removes any preceding and trailing letters from the string.
If the only non-number part of the input will be the first letter, the fastest way will probably be to slice the string:
s = 'A067'
n = int(s[1:])
print n
If you believe that you will find more than one number per string though, the above regex answers will most likely be easier to work with.
You could use regular expressions to find numbers.
import re
s = 'A067'
s = re.findall(r'\d+', s) # This will find all numbers in the string
n = int(s[0]) # This will get the first number. Note: If no numbers will throw exception. A simple check can avoid this
print n
Here's some example output of findall with different strings
>>> a = re.findall(r'\d+', 'A067')
>>> a
['067']
>>> a = re.findall(r'\d+', 'A067 B67')
>>> a
['067', '67']
You can use the replace method of regex from re module.
import re
regex = re.compile("(?P<numbers>.*?\d+")
matcher = regex.search(line)
if matcher:
numbers = int(matcher.groupdict()["numbers"] #this will give you the numbers from the captured group
import string
str = 'A067'
print (int(str.strip(string.ascii_letters)))

Categories

Resources