Counting words that start with capital letter on python [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
**im a newbie here. Im kinda working on a program that counts words that starts with capital letter, per line inside a csv file using python. I used regex but i think it doesnt work. Here is the sample code that i made but unfortunately it doesnt give the output that i want. hope you could help me.
**
import re
line_details = []
result = []
count = 0
total_lines = 0
class CapitalW(): #F8 Word that starts with capital letter count
fh = open(r'20items.csv', "r", encoding = "ISO-8859-1").read()
#next(fh)
for line in fh.split("n"):
total_lines += 1
for line in re.findall('[A-Z]+[a-z]+$', fh):
count+=1
line_details.append("Line %d has %d Words that start with capital letter" %
(total_lines, count))
for line in line_details:
result7 = line
print (result7)
**- result should be as follows:
Line 1 has 2 Words that start with capital letter
Line 2 has 5 Words that start with capital letter
Line 3 has 1 Words that start with capital letter
Line 4 has 10 Words that start with capital letter**

In the regex you doens't need the $ character beacause [A-Z]+[a-z]+$ matches only if there is one word in the line. So [A-Z]+[a-z]+ instead.
The other, is, that I see from the encoding, that you maybe use characters what are not between a-z for example é. So you maybe have to add these also to the pattern. [A-ZÉÖ]+[a-zéö]+ and add all the other special characters.

Assuming a fixed indentation and in addition to matebende's answer, these are the required further corrections:
for line in fh.split("n"): is supposed to be for line in fh.split("\n"):.
The initialization count = 0 has to be inside this for loop.
The fh in for line in re.findall('[A-Z]+[a-z]+$', fh): is wrong and has to be line.

Related

Python and Regex - Replace Date [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have multiple strings in the form of :
AM-2019-04-22 06-47-57865BCBFB-9414907A-4450BB24
And I need the month from the date part replaced with something else, for example:
AM-2019-07-22 06-47-57865BCBFB-9414907A-4450BB24
How can I achieve this using python and regex?
Also, I have multiple text files that contain a line similar to this:
LocalTime: 21/4/2019 21:48:41
And I need to do the same thing as above (replace the month with something else).
For your first example:
import re
string = 'AM-2019-07-22 06-47-57865BCBFB-9414907A-4450BB24'
replace = '0'+str(int((re.match('(AM-\d{4}-)(\d{2})',string).group(2)))+2)
re.sub('(AM-\d{4}-)(\d{2})',r'\g<1>'+replace,string) #Replace 07 for 07 +2
Output:
Out[10]: 'AM-2019-09-22 06-47-57865BCBFB-9414907A-4450BB24'
For the second one:
string2 = 'LocalTime: 21/4/2019 21:48:41'
replace = str(int((re.match(r'(LocalTime: \d{1,2}/)(\d{1,2}).*',string2).group(2)))+2)
re.sub('(Time: \d{2}/)(\d{1,2})',r'\g<1>'+replace,string2) #Replace 4 for 6
Output:
Out[14]: 'LocalTime: 21/6/2019 21:48:41'
If you want to limit the months in which this operation is done, you can use an if statement:
if re.match('(AM-\d{4}-)(\d{2})',string).group(2).isin(['04','05','06']:
if re.match(r'(LocalTime: \d{1,2}/)(\d{1,2}).*',string2).group(2).isin(['4','5','6']:
Similar answer but with more code and a lookbehind.
First question:
import re
#This could be any number of strings, or some other iterable
string_list = []
string_list.append("AM-2019-04-22 06-47-57865BCBFB-9414907A-4450BB24")
string_list.append("AM-2019-07-22 06-47-57865BCBFB-9414907A-4450BB24")
#This checks for four digits and a hyphen, then any number of digits to
#replace (which is the month
pattern = r"(?<=\d{4}-)\d+"
#This should be a string
month = "08"
for string in string_list:
print("BEFORE: " + string)
string = re.sub(pattern, month, string)
print("AFTER: " + string)
Second question:
import re
#checks for a colon, 2 numbers, a forward slash, and then selects however many numbers after (which is the month)
pattern = r"/(?<=: \d{2}/)\d+"
#IMO it's better just to write to another file. You can edit the current file you're on, but it's cleaner this way and you won't accidentally screw it up if my regex is wrong.
in_file = open("mytextfile.txt", 'r')
out_file = open("myoutputfile.txt", 'w')
#This should be a string
month = "9"
for line in in_file:
changed_line = re.sub(pattern, month, line)
out_file.write(changed_line)
in_file.close()
out_file.close()
Hope this helps.

Python code that search for text and copy it to the next line [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I python code to read lines in a text file and to copy text between specific characters. For example, text between _ _.
Input
./2425/1/115_Lube_45484.jpg 45484
./2425/1/114_Spencerian_73323.jpg 73323
Output
./2425/1/115_Lube_45484.jpg 45484
Lube
./2425/1/114_Spencerian_73323.jpg 73323
Spencerian
Any suggestions?
Instead of regex i would use build in: split()
input = './2425/1/114_Spencerian_73323.jpg 73323'
output = input.split('_')[1]
print(output)
Of course if every line has double _ in input string
Try this:
import re
for line in your_text.splitlines():
result = re.match("_(.*)_", your_text)
print(match.group(0))
print(match.group(1))
Where your_text is a string containing your example as above.
test = './2425/1/114_Spencerian_73323.jpg_abc_ 73323'
result = test.split("_",1)[1].split("_")[0]
print(result)
.split('',1) splits the string in 2 parts i-e: 0 index will be left substring of '' and 1 index will be right substring of string. We again split the right part of string with '_' so that the text between _ will be extracted.
Note : this will be helpful only when there is single occurence of text between _ like test. It wont extract text if there exist this case multiple times in a string
Solved.
file_path = "text_file.txt"
with open(file_path) as f:
line = f.readline()
count= 1
while line:
print(line,line.split('_')[1])
line = f.readline()
count+= 1
Thank you all

Getting a word with mid frequency [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 8 years ago.
Improve this question
I have a wordlist containing numbers, English Words, and Bengali words in a column and in other column I have their frequencies. These columns have no headers. I need the words with frequencies between 5- 300. This is the code I am using. It is not working.
wordlist = open('C:\\Python27\\bengali_wordlist_full.txt', 'r').read().decode('string-escape').decode("utf-8")
for word in wordlist:
if word[1] >= 3
print(word[0])
elif word[1] <= 300
print(word[0])
This is giving me a syntax error.
File "<stdin>", line 2
if word[1] >= 3
^
SyntaxError: invalid syntax
Can anyone please help?
You should add : after your if statements to fix this SyntaxError:
wordlist = open('C:\\Python27\\bengali_wordlist_full.txt', 'r').read().decode('string-escape').decode("utf-8")
for word in wordlist:
if word[1] >= 3:
print word[0]
elif word[1] <= 300:
print word[0]
Read this:
https://docs.python.org/2/tutorial/controlflow.html
Also here it is one useful tip: when python gives you SyntaxError for some line, always look at the previous line, then at the following one.
There are few problems with your code, I add full explanation in an hour and so. See how it should look like and consult docs in the meantime:
First, it is safer to use with open() clause for opening files (see https://docs.python.org/2/tutorial/inputoutput.html#methods-of-file-objects)
filepath = 'C:/Python27/bengali_wordlist_full.txt'
with open(filepath) as f:
content = f.read().decode('string-escape').decode("utf-8")
# do you really need all of this decdcoding?
Now content holds text from file: this is one, long string, with '\n' characters to mark endlines. We can split it to list of lines:
lines = content.splitlines()
and parse one line at the time:
for line in lines:
try:
# split line into items, assign first to 'word', second to 'freq'
word, freq = line.split('\t') # assuming you have tab as separator
freq = float(freq) # we need to convert second item to numeric value from string
if 5 <= freq <= 300: # you can 'chain' comparisons like this
print word
except ValueError:
# this happens if split() gives more than two items or float() fails
print "Could not parse this line:", line
continue

Python Anagrams recursion [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 months ago.
Improve this question
I have this code:
mylist = open('sortedwords.txt')
txt = mylist.read()
mylist = txt.split()
stuff = input('Type a word here: ')
def removeletters (word, Analysis):
for char in range (len(Analysis)):
if Analysis [char] in word:
word = word.replace(Analysis[char],"",1)
return word
def anagramSubset(word, textList):
newWord = word
for char in range(len(textList)):
if textList[char] not in newWord:
return False
else:
newWord = newWord.replace(textList[char],"",1)
return True
def anagram(word, textList):
savedWords =[]
for checkword in textList:
if len(word) == len(checkword) and anagramSubset(word, checkword):
savedWords.append(checkword)
print(checkword)
anagram(stuff, mylist)
It is supposed to take an input word, remove letters from the input word, then make a subset of words and save that to an array to print off of.
The problem is that the code will save every word that can be created from the input. E.g. an input of spot results in top, tops, stop, pots, pot, etc. The result should only have tops, pots, and stop.
What is wrong with the code, and how do I fix it?
I looked at the code and am wondering what the recursion is adding? The first pass does all of the computational work and then the recursion adds some extra stack frames and alters how output is printed. Am I making the wrong assumption that textList is a list of valid words split from a single line in a file?
When I run this locally with a particular word list, this gets the same effect (in the sense that it finds words whose letters are a subset) with less thrashing:
def anagram(word, textList):
savedWords = []
for checkword in textList:
if anagramSubset(word, checkword):
savedWords.append(checkword)
print(savedWords)
If the problem eventually becomes that you're getting words that have too few letters, you could fix your problem by checking that a word is the length of the original word before you add it with:
if len(original_word) == len(checkword):
savedWords.append(checkword)

Counting three letter acronyms in a line with Regex Python [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I need to make a program in python which looks through a given file. Let's say acronyms.txt, and then returns a percentage value of how many lines contain at least 1 three letter acronym.
For example:
NSW is a very large state.
It's bigger than TAS.
but WA is the biggest!
After reading this it should return 66.7% as 66.7% of the lines contain a three letter acronym. It is also rounded to the first decimal place as you can see. I am not very familiar with regex but I think it would be simplest with regex.
EDIT:
I have finished the code but i need it to recognize acronyms with dots between them, EG N.S.W should be recognized as an acronym. How do i do this?
Any help would be appreciated!
You can do:
import re
cnt = 0
with open('acronyms.txt') as myfile:
lines = myfile.readlines()
length = len(lines)
for line in lines:
if re.search(r'\b[A-Z]{3}\b', line) is not None:
cnt += 1
print("{:.1f}%".format(cnt/length*100))
r'[A-Z]{3}' matches three (and only three) capital letters in a row. If a search is found, then we add a count.
Then we simply do the count divided by the length of lines, and print the result as you have shown.
You can do something like:
total_lines = 0
matched_lines = 0
for line in open("filename"):
total_lines += 1
matched_lines += bool(re.search(r"\b[A-Z]{3}\b", line))
print "%f%%" % (float(matched_lines) / total_lines * 100)
Note '\b' in search pattern -- it matches empty string in beginning or end of word. It helps you to prevent unwanted matches with acronyms longer than 3 ('asdf ASDF asdf') or with acronyms inside word ('asdfASDasdf').

Categories

Resources