Python Grabbing String in between characters - python

If I have a string like /Hello how are you/, how am I supposed to grab this line and delete it using a python script.
import sys
import re
i_file = sys.argv[1];
def stripwhite(text):
lst = text.split('"')
for i, item in enumerate(lst):
if not i % 2:
lst[i] = re.sub("\s+", "", item)
return '"'.join(lst)
with open(i_file) as i_file_comment_strip:
i_files_names = i_file_comment_strip.readlines()
for line in i_files_names:
with open(line, "w") as i_file_data:
i_file_comment = i_file_data.readlines();
for line in i_file_comment:
i_file_comment_data = i_file_comment.strip()
In the i_file_comment I have the lines from i_file_data and i_file_comment contains the lines with the "/.../" format. Would I use a for loop through each character in the line and replace every one of those characters with a ""?

If you want to remove the /Hello how are you/ you can use regex:
import re
x = 'some text /Hello how are you/ some more text'
print (re.sub(r'/.*/','', x))
Output:
some text some more text

If you know you have occurences of a fixed string in your lines, you can simply do
for line in i_file_comment:
line = line.replace('/Hello how are you/', '')
however, if what you have is multiple occurences of strings delimited by / (i.e. /foo/, /bar/), I think using a simple regex will sufice:
>>> import re
>>> regex = re.compile(r'\/[\w\s]+\/')
>>> s = """
... Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
... /Hello how are you/ ++ tempor incididunt ut labore et dolore magna aliqua.
... /Hello world/ -- ullamco laboris nisi ut aliquip ex ea commodo
... """
>>> print re.sub(regex, '', s) # find substrings matching the regex, replace them with '' on string s
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
++ tempor incididunt ut labore et dolore magna aliqua.
-- ullamco laboris nisi ut aliquip ex ea commodo
>>>
just adjust the regex to what you need to get rid of :)

Related

How would I limit the number of characters per line in python from an input

Would there be a way to limit the amount of characters that are printed per line?
while 1:
user_message = ""
messageQ = input("""\nDo you want to enter a message?
[1] Yes
[2] No
[>] Select an option: """)
if messageQ == "1":
message = True
elif messageQ == "2":
message = False
else:
continue
if message == True:
print(
"""
-----------------------------------------------------------------
You can enter a custom message that is below 50 characters.
""")
custom_message = input("""\nPlease enter your custom message:\n \n> """)
if len(custom_message) > 50:
print("[!] Only 50 characters allowed")
continue
else:
print(f"""
Your Custom message is:
{custom_message}""") #here is where I need to limit the number of characters per line to 25
break
So where I print it here:
Your Custom message is:
{custom_message}""") #here is where I need to limit the number of characters per line to 25
I need to limit the output to 25 characters per line.
You can do
message = "More than 25 characters in this message!"
print(f"{message:.25}")
Output
More than 25 characters i
You might use textwrap.fill to break excessively long string into lines, example usage
import textwrap
message = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
print(textwrap.fill(message, 25))
output
Lorem ipsum dolor sit
amet, consectetur
adipiscing elit, sed do
eiusmod tempor incididunt
ut labore et dolore magna
aliqua. Ut enim ad minim
veniam, quis nostrud
exercitation ullamco
laboris nisi ut aliquip
ex ea commodo consequat.
Duis aute irure dolor in
reprehenderit in
voluptate velit esse
cillum dolore eu fugiat
nulla pariatur. Excepteur
sint occaecat cupidatat
non proident, sunt in
culpa qui officia
deserunt mollit anim id
est laborum.
>>> my_str = """This is a really long message that is longer than 25 characters"""
#For 25 characters TOTAL
>>> print(f"This is your custom message: {my_str}"[:25])
'This is your custom messa'
#For 25 characters in custom message
>>> print(f"This is your custom message: {my_str[:25]}")
This is your custom message: This is a really long mes
This takes advantage of the substring operator. This cuts off any characters past the 25th character.
As have already checked that the message is not more than 50 characters we just need to know whether it is more or less than 25 characters long.
ln = len(custom_message) -1 # because strings are 0 indexed
if ln < 25:
print(custom_message)
else:
print(f"This is your custom message: {my_str}"[:ln])
print(f"This is your custom message: {my_str}"[25:ln])
``

Struggling with reading a text file and use of nested loops

I have attempted making a program that counts the number of occurrences of "[AB]" in a text file by searching each file individually (after loading and opening the file of course) but it doesn't seem to work, and I have no idea why.
Here is the program:
# NOTE: to make it work try making more functions that return values and check if
# for the beginning and end of the names
# to deal with the issue of local variable scope
#imports and reads first line of text file
print("Opening and closing file")
print("\nReading characters from file.")
text_file = open("chat3.txt", "r")
#prints current line just for checking(can remove later)
x = 0
ABcount = 0
d = 0
length = len(text_file.readlines())
print("There are no of lines ", length)
line = text_file.readline()
print("the current line is ", line)
#loop to find most commonly used words( a tuple with word(string): no of occurences(int))
print("point 1(before loop 1)")
for d in range(0, length):
print("point 2(just into loop 1)")
c = text_file.readline()#reads one line and stores it in variable c as a string
count = len(c)#gets the length of line/no of characters in it as the next loop will iterate for each one
print(c)
print("point 3(in loop 1 after printing current line)")
for x in range(0, count):
print("This is count number", x+1)
c2 = c[x]
print("Current char is ", c2)
if(('[' in c) and (c2 == '[')):
start = c.index('[') + 1
end = c.index(':')
ABcount += 1
print("There is/are ", ABcount, c[start:end])
elif ( not '[' in c):
break
text_file.close()
And chat3.txt content's are:
nn an an [AB:2020]
[AB]
[AB]
And the results from comp + running are
PS C:\Users\test> python counter.py
Opening and closing file
Reading characters from file.
There are no of lines 3
the current line is
point 1(before loop 1)
point 2(just into loop 1)
point 3(in loop 1 after printing current line)
point 2(just into loop 1)
point 3(in loop 1 after printing current line)
point 2(just into loop 1)
point 3(in loop 1 after printing current line)
PS C:\Users\test>
Use regex for this kind of thing
t.txt
Deserunt velit ipsum quis id aliquip commodo deserunt nulla officia ea dolor reprehenderit pariatur. Sit laboris culpa in non et. Do laborum aliqua sunt voluptate occaecat anim magna eu. Est tempor ad non consectetur ea reprehenderit est quis et. Culpa eu sit amet est ullamco eiusmod et sit excepteur et cupidatat ullamco consectetur Lorem. Dolore elit dolore proident consectetur ipsum non. Sunt veniam incididunt duis veniam dolor sunt fugiat irure eiusmod.
Nulla eiusmod voluptate aute tempor amet aliquip ad culpa dolor labore consequat ut ea proident. Qui minim velit elit ut excepteur fugiat nisi esse do et sit. Consequat est pariatur officia incididunt et pariatur laborum aute veniam do adipisicing.
Eu aliqua ex ex irure. Mollit adipisicing est id quis eiusmod aliqua ullamco cupidatat. Lorem ea esse magna aliqua aute occaecat. Velit in enim ut ad eu magna amet fugiat labore amet ea.
Adipisicing duis enim tempor ipsum magna duis. Consectetur ullamco adipisicing est aute fugiat qui excepteur nostrud nisi laboris ipsum. Officia sunt eiusmod consectetur dolor do et adipisicing duis cillum. Adipisicing esse exercitation deserunt labore Lorem deserunt consectetur ad laboris anim sit veniam ex ea. Minim voluptate pariatur dolor adipisicing commodo voluptate consectetur aute id officia irure elit. Cillum eiusmod esse nulla enim nostrud mollit voluptate incididunt ullamco anim cillum officia.
script
with open('r.txt','r') as file:
f=file.read()
import re
re.findall('ab',f)
print(re.findall('ab',f))
# ['ab', 'ab', 'ab', 'ab', 'ab', 'ab', 'ab', 'ab']
To answer your question, it does not enter your loop because when you first call readlines, it set the cursor at the end of the file and so the next readline returns nothing. This might help: Why the second time I run "readlines" on the same file nothing is returned?
If you want to loop a file line by line just do for line in file:
For the rest, as suggested in other answers there are most certainly better way to do this, but I believe it is not the question here.

RMarkdown: knitr::purl() on Python code chunk?

I want to export my Python code chunk in RMarkdown to an external file. knitr::purl() achieves this, but I am only able to make it work on R code chunks. Does it not work for any other language than R?
For example, from below, export the python code into a my_script.py file.
---
title: "Untitled"
output: html_document
---
## Header
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
quis nostrud exercitation ullamco laboris nisi ut aliquip
```{python}
x = 10
y = 20
z = x + y
print(z)
```
Currently purl outputs non-R code commented out. So we need to redefine output function to override this.
Here is a simple script that (1) outputs python code only, and (2) strips documentation (I took the function from knitr source and hacked it):
library("knitr")
# New processing functions
process_tangle <- function (x) {
UseMethod("process_tangle", x)
}
process_tangle.block <- function (x) {
params = opts_chunk$merge(x$params)
# Suppress any code but python
if (params$engine != 'python') {
params$purl <- FALSE
}
if (isFALSE(params$purl))
return("")
label = params$label
ev = params$eval
code = if (!isFALSE(ev) && !is.null(params$child)) {
cmds = lapply(sc_split(params$child), knit_child)
one_string(unlist(cmds))
}
else knit_code$get(label)
if (!isFALSE(ev) && length(code) && any(grepl("read_chunk\\(.+\\)",
code))) {
eval(parse_only(unlist(stringr::str_extract_all(code,
"read_chunk\\(([^)]+)\\)"))))
}
code = knitr:::parse_chunk(code)
if (isFALSE(ev))
code = knitr:::comment_out(code, params$comment, newline = FALSE)
# Output only the code, no documentation
return(knitr:::one_string(code))
}
# Reassign functions
assignInNamespace("process_tangle.block",
process_tangle.block,
ns="knitr")
# Purl
purl("tmp.Rmd", output="tmp.py")
Here is my tmp.Rmd file. Note that it has an R chunk, which I do not want in the result:
---
title: "Untitled"
output: html_document
---
## Header
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
quis nostrud exercitation ullamco laboris nisi ut aliquip
```{python}
#!/usr/bin/env python
# A python script
```
```{python}
x = 10
y = 20
z = x + y
print(z)
```
```{r}
y=5
y
```
Running Rscript extract.R I get tmp.py:
#!/usr/bin/env python
# A python script
x = 10
y = 20
z = x + y
print(z)
PS I found this question searching for the solution to the same problem. Since nobody answered it, I developed my own solution :)

Python common word finder

I have a small program that looks at a text file and displays how many time the word was used. Instead of printing words, it prints most commonly used letters not words and I don't understand what the problem.
import re
from collections import Counter
words = re.findall(r'\w', open('words.txt').read().lower())
count = Counter(words).most_common(8)
print(count)
I hope this helps, this is a regular expression answer and should go word by word.
import re
with open("words.txt") as f:
for line in f:
for word in re.findall(r'\w+', line):
# word by word
if you do not have quotes around your data and you just want one word at a time (ignoring the meaning of spaces vs line breaks in the file) try this:
with open('words.txt','r') as f:
for line in f:
for word in line.split():
print(word)
import string
words = open('words.txt').read().lower()
# skip punctuation
words = words = words.translate(str.maketrans('', '',string.punctuation)).split()
count = Counter(words).most_common(8)
in regex \w means just any character, not any word. You can get a list of words doing:
words= ' '.split( open('words.txt').read().lower())
And then you perform what you were doing:
count = Counter(words).most_common(8)
print(count)
I guess that should suffice, tell me if it isn't working.
Assuming you have following text file:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do
eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad
minim veniam, quis nostrud exercitation ullamco laboris nisi ut
aliquip ex ea commodo consequat. Duis aute irure dolor in
reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
culpa qui officia deserunt mollit anim id est laborum.
And you want to calculate words frequency:
import operator
with open('text.txt') as f:
words = f.read().split()
result = {}
for word in words:
result[word] = words.count(word)
result = sorted(result.items(), key=operator.itemgetter(1), reverse=True)
print(result)
You'll get list of words with number of occurences for each word sorted descending:
[('in', 3), ('dolor', 2), ('ut', 2), ('dolore', 2), ('Lorem', 1),
('ipsum', 1), ...

get words from large file, using low memory in python

I need to iterate over the words in a file. The file could be very big (over 1TB), the lines could be very long (maybe just one line). Words are English, so reasonable in size. So I don't want to load in the whole file or even a whole line.
I have some code that works, but may explode if lines are to long (longer than ~3GB on my machine).
def words(file):
for line in file:
words=re.split("\W+", line)
for w in words:
word=w.lower()
if word != '': yield word
Can you tell be how I can, simply, rewrite this iterator function so that it does not hold more than needed in memory?
Don't read line by line, read in buffered chunks instead:
import re
def words(file, buffersize=2048):
buffer = ''
for chunk in iter(lambda: file.read(buffersize), ''):
words = re.split("\W+", buffer + chunk)
buffer = words.pop() # partial word at end of chunk or empty
for word in (w.lower() for w in words if w):
yield word
if buffer:
yield buffer.lower()
I'm using the callable-and-sentinel version of the iter() function to handle reading from the file until file.read() returns an empty string; I prefer this form over a while loop.
If you are using Python 3.3 or newer, you can use generator delegation here:
def words(file, buffersize=2048):
buffer = ''
for chunk in iter(lambda: file.read(buffersize), ''):
words = re.split("\W+", buffer + chunk)
buffer = words.pop() # partial word at end of chunk or empty
yield from (w.lower() for w in words if w)
if buffer:
yield buffer.lower()
Demo using a small chunk size to demonstrate this all works as expected:
>>> demo = StringIO('''\
... Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque in nulla nec mi laoreet tempus non id nisl. Aliquam dictum justo ut volutpat cursus. Proin dictum nunc eu dictum pulvinar. Vestibulum elementum urna sapien, non commodo felis faucibus id. Curabitur
... ''')
>>> for word in words(demo, 32):
... print word
...
lorem
ipsum
dolor
sit
amet
consectetur
adipiscing
elit
pellentesque
in
nulla
nec
mi
laoreet
tempus
non
id
nisl
aliquam
dictum
justo
ut
volutpat
cursus
proin
dictum
nunc
eu
dictum
pulvinar
vestibulum
elementum
urna
sapien
non
commodo
felis
faucibus
id
curabitur

Categories

Resources