Extracting characters from text file

Extracting characters from text file - python

i have a text file that states:
The quick brown fox jumps over the lazy dog.
I want to extract the characters in even positions starting at zero and create a string from them like string_even =Teqikbonfxjmsoe h aydg
as well as the characters in odd positions like string_odd = h uc rw o up vrtelz o.
i am just learning how to read text files and do not know how to approach this problem

print txt[0::2]
print txt[1::2]

Related

Remove space delimited single characters

I have texts that look like this:
the quick brown fox 狐狸 m i c r o s o f t マ イ ク ロ ソ フ ト jumps over the lazy dog 跳過懶狗 best wishes : John Doe
What's a good regex (for python) that can remove the single-characters so that the output looks like this:
the quick brown fox 狐狸 jumps over the lazy dog 跳過懶狗 best wishes John Doe
I've tried some combinations of \s{1}\S{1}\s{1}\S{1}, but they inevitably end up removing more letters than I need.

You can replace the following with empty string:
(?<!\S)\S(?!\S).?
Match a non-space that has no non-spaces on either side of it (i.e. surrounded by spaces), plus the character after that (if any).
The reason why I used negative lookarounds is because it neatly handles the start/end of string case. We match the extra character that follows the \S to remove the space as well.
Regex101 Demo

A non-regex version might look like:
source_string = r"this is a string I created"
modified_string =' '.join([x for x in source_string.split() if len(x)>1])
print(modified_string)

Please try the below code using regex, where I am looking for at-least two occurrences of characters that can remove a single character problem.
s='the quick brown fox 狐狸 m i c r o s o f t マ イ ク ロ ソ フ ト jumps over the lazy dog 跳過懶狗 best wishes : John Doe'
output = re.findall('\w{2,}', s)
output = ' '.join([x for x in output])
print(output)

Regular expression for returning lines of dialogue

-I am a beginner python coder so bear with me!
A line of complete dialog is defined as text that starts on its own line and starts and ends with double quotation marks (i.e. ").
what i have so far is,
def q_4():
pattern = r'^\"\w*\"'
return re.compile(pattern, re.M|re.IGNORECASE)
but for some reason it only returns one instance with one word between the two double quotes. How can i go about grasping full lines?

Try searching on the pattern \"[^"]+\":
inp = """Here is a quote: "the quick brown fox jumps over
the lazy dog" and here is another "blah
blah blah" the end"""
dialogs = re.findall(r'\"([^"]+)\"', inp)
print(dialogs)
This prints:
['the quick brown fox jumps over\nthe lazy dog', 'blah\nblah blah']

Cropping out a portion of a string and printing using regex

I am trying to crop a portion of a list of strings and print them. The data looks like the following -
Books are on the table\nPick them up
Pens are in the bag\nBring them
Cats are roaming around
Dogs are sitting
Pencils, erasers, ruler cannot be found\nSearch them
Laptops, headphones are lost\nSearch for them
(This is just few lines from 100 lines of data in the file)
I have to crop the string before the \n in line 1,2,5,6 and print them. I have to also print line 3,4 along with them. Expected output -
Books are on the table
Pens are in the bag
Cats are roaming around
Dogs are sitting
Pencils erasers ruler cannot be found
Laptops headphones are lost
What I have tried so far -
First I replace the comma with a space - a = name.replace(',',' ');
Then I use regex to crop out the substring. My regex expression is - b = r'.*-\s([\w\s]+)\\n'. I am unable to print line 3 and 4 in which \n is not present.
The output that I am receiving now is -
Books are on the table
Pens are in the bag
Pencils erasers ruler cannot be found
Laptops headphones are lost
What should I add to my expression to print out lines 3 and 4 as well?
TIA

You may match and remove the line parts starting with a combination of a backslash and n, or all punctuation (non-word and non-whitespace) chars using a re.sub:
a = re.sub(r'\\n.*|[^\w\s]+', '', a)
See the regex demo
Details
\\n.* - a \, n, and then the rest of the line
| - or
[^\w\s]+ - 1 or more chars other than word and whitespace chars
If you need to make sure there is an uppercase letter after \n, you may add [A-Z] after n in the pattern.

I know many people like to twist their minds into knots with regex but why not,
with open('geek_lines.txt') as lines:
for line in lines:
print (line.rstrip().split(r'\n')[0])
Simple to write, simple to read, seems to produce the correct result.
Books are on the table
Pens are in the bag
Cats are roaming around
Dogs are sitting
Pencils, erasers, ruler cannot be found
Laptops, headphones are lost

How to use text.split() and retain blank (empty) lines

New to python, need some help with my program. I have a code which takes in an unformatted text document, does some formatting (sets the pagewidth and the margins), and outputs a new text document. My entire code works fine except for this function which produces the final output.
Here is the segment of the problem code:
def process(document, pagewidth, margins, formats):
res = []
onlypw = []
pwmarg = []
count = 0
marg = 0
for segment in margins:
for i in range(count, segment[0]):
res.append(document[i])
text = ''
foundmargin = -1
for i in range(segment[0], segment[1]+1):
marg = segment[2]
text = text + '\n' + document[i].strip(' ')
words = text.split()
Note: segment [0] means the beginning of the document, and segment[1] just means to the end of the document if you are wondering about the range. My problem is when I copy text to words (in words=text.split() ) it does not retain my blank lines. The output I should be getting is:
This is my substitute for pistol and ball. With a
philosophical flourish Cato throws himself upon his sword; I
quietly take to the ship. There is nothing surprising in
this. If they but knew it, almost all men in their degree,
some time or other, cherish very nearly the same feelings
towards the ocean with me.
There now is your insular city of the Manhattoes, belted
round by wharves as Indian isles by coral reefs--commerce
surrounds it with her surf.
And what my current output looks like:
This is my substitute for pistol and ball. With a
philosophical flourish Cato throws himself upon his sword; I
quietly take to the ship. There is nothing surprising in
this. If they but knew it, almost all men in their degree,
some time or other, cherish very nearly the same feelings
towards the ocean with me. There now is your insular city of
the Manhattoes, belted round by wharves as Indian isles by
coral reefs--commerce surrounds it with her surf.
I know the problem happens when I copy text to words, since it doesn't keep the blank lines. How can I make sure it copies the blank lines plus the words?
Please let me know if I should add more code or more detail!

First split on at least 2 newlines, then split on words:
import re
paragraphs = re.split('\n\n+', text)
words = [paragraph.split() for paragraph in paragraphs]
You now have a list of lists, one per paragraph; process these per paragraph, after which you can rejoin the whole thing into new text with double newlines inserted back in.
I've used re.split() to support paragraphs being delimited by more than 2 newlines; you could use a simple text.split('\n\n') if there are ever only going to be exactly 2 newlines between paragraphs.

use a regexp to find the words and the blank lines rather than split
m = re.compile('(\S+|\n\n)')
words=m.findall(text)

Spliting text file with conditional operators by python

I'm having a huge file, which consists excessive length of transcribed speach for about two days straight. Over 100,000 words I guess.
During transcription, I have separated speaker and sessions by "<-- Name -->" mark into different blocks. My problem is, is it possible to automatically process them into files in a naming convention of name_speach.txt ?
THANKS!!!!
Test cases:
Test case
<--测试0-->
这个是一段测试内容，a quick fox jumps over a lazy dog.
<——测试1——>
，a quick fox just over 啊 辣子 dog!！？是吗？
<——测试2——>
这是一段测试用的text，嗯！
<--Test case 3-->
/* sound track lost #153:12.236 -- 153.18.222 */
…
A quick fox jumps over a {lazy|lame} dog.

So you want to search every pattern "<-- Name -->" in a text file (100000 words is not very huge for computer memory, I think).
You can use Regular expression for search tags.
In Python, It's something like:
import re
NAMETAG = r'\<\-\- (?P<name>.*?) \-\-\>'
# find all nametags in your string
matches = re.findall(NAMETAG, yourtext)
offset_start_list = []
offset_end_list = []
name_list = []
for m in matches:
name = m.groups()['name']
name_list.append(name)
# find content offset after name tag
offset_start_list.append(m.end() + 1)
# the last content's end
offset_end_list.append(m.start())
offset_end_list.pop(0)
offset_end_list.append(len(yourtext))
for name, start, end in zip(name_list, offset_start_list, offset_end_list):
# save your files here

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting characters from text file - python

print txt[0::2] print txt[1::2]

Related

Remove space delimited single characters

Regular expression for returning lines of dialogue

Cropping out a portion of a string and printing using regex

How to use text.split() and retain blank (empty) lines

Spliting text file with conditional operators by python

Categories

Resources