Find second occurence of newline before a "-" - python

So some sample texts are this:
Greece: Rare
Athens
Patras
------
Italy: Unique
Milan
------
and i want to get the whole text between the second occurence of a newline before the "-" and the "-".
Expected output:
Patras
Milan
Is this possible through regex or should i try something else?

just search for line before the dashes:
import re
text="""Greece: Rare
Athens
Patras
------
"""
print(re.search("(.*)\n-+",text).group(1))
prints
Patras
note that (.*) group matches the line but not the previous lines thanks to the fact that . doesn't match \n by default.
Without regex, this can be done by looking at the index of the dashed line, and printing the previous line.
lines = text.splitlines()
index = next(i for i,x in enumerate(lines) if x.startswith("-"))
print(lines[index-1])
I'd go for the regex solution though.

This is a solution:
import re
texts=["""Greece: Rare
Athens
Patras
------
""","""Italy: Unique
Milan
------"""]
for text in texts:
print(re.search("\n(.*)\n[-]",text).group(1))
Output:
Patras
Milan

Related

replace punctuation with space in text

I have a text like this Cat In A Tea Cup by New Yorker cover artist Gurbuz Dogan Eksioglu,Handsome cello wrapped hard magnet, Ideal for home or office.
I removed punctuations from this text by the following code.
import string
string.punctuation
def remove_punctuation(text):
punctuationfree="".join([i for i in text if i not in string.punctuation])
return punctuationfree
#storing the puntuation free text
df_Train['BULLET_POINTS']= df_Train['BULLET_POINTS'].apply(lambda x:remove_punctuation(x))
df_Train.head()
here in the above code df_Train is a pandas dataframe in which "BULLET_POINTS" column contains the kind of text data mentioned above.
The result I got is Cat In A Tea Cup by New Yorker cover artist Gurbuz Dogan EksiogluHandsome cello wrapped hard magnet Ideal for home or office
Notice how two words Eksioglu and Handsome are combing due to no space after , . I need a way to overcome this issue.
In these case, it makes sense to replace all the special chars with a space, and then strip the result and shrink multiple spaces to a single space:
df['BULLET_POINTS'] = df['BULLET_POINTS'].str.replace(r'(?:[^\w\s]|_)+', ' ', regex=True).str.strip()
Or, if you have chunks of punctuation + whitespace to handle:
df['BULLET_POINTS'].str.replace(r'[\W_]+', ' ', regex=True).str.strip()
Output:
>>> df['BULLET_POINTS'].str.replace(r'(?:[^\w\s]|_)+', ' ', regex=True).str.strip()
0 Cat In A Tea Cup by New Yorker cover artist Gurbuz Dogan Eksioglu Handsome cello wrapped hard magnet Ideal for home or office
Name: BULLET_POINTS, dtype: object
The (?:[^\w\s]|_)+ regex matches one or more occurrences of any char other than word and whitespace chars or underscores (i.e. one or more non-alphanumeric chars), and replaces them with a space.
The [\W_]+ pattern is similar but includes whitespace.
The .str.strip() part is necessary as the replacement might result in leading/trailing spaces.

SAS/Python: Find any spaces followed by a non-space string and replace space with a different value

I have data that looks like this:
1937 Paredes 3-1
1939 Suazo 2-0
1941 Fernandez 4-0
1944 Wilchez 2-1
…
2017 Miralles 5-7
I want to read each line as a line of text. Find any instance of a space followed by a number, character, or any non-space symbol. Replace the space that precedes that number, character, or any non-space symbol with a "|" in the following manner:
1937 |Paredes |3-1
1939 |Suazo |2-0
1941 |Fernandez |4-0
1944 |Wilchez |2-1
...
2017 |Miralles |5-7
Any idea how to this in SAS or Python?
You might use re.sub matching a space and assert a non whitespace char to the right:
import re
test_str = ("1937 Paredes 3-1\n\n"
"1939 Suazo 2-0\n\n"
"1941 Fernandez 4-0\n\n"
"1944 Wilchez 2-1")
result = re.sub(r" (?=\S)", "|", test_str)
if result:
print (result)
Output
1937|Paredes|3-1
1939|Suazo|2-0
1941|Fernandez|4-0
1944|Wilchez|2-1
Or find multiple whitespace chars without a newline
result = re.sub(r"[^\S\r\n]+(?=\S)", "|", test_str)
I don't understand the need to preserve the other spaces. Why not just remove them all?
data _null_;
infile 'have.txt' truncover;
file 'want.txt' dsd dlm='|';
input (var1-var3) (:$100.);
put var1-var3;
run;
Results
1937|Paredes|3-1
1939|Suazo|2-0
1941|Fernandez|4-0
1944|Wilchez|2-1
2017|Miralles|5-7

Trouble scooping out a certain portion of text out of a chunk

How can I get the address appeared before Telephone from the portion of text I've pasted below. I tried with the following but it gives me nothing.
This is the code I've tried so far with:
import re
content="""
Campbell, Bellam Associés Inc.
3003 Rue College
Sherbrooke, QC J1M 1T8
Telephone: 819-569-9255
Website: http://www.assurancescb.com
"""
pattern = re.compile(r"(.*)(?=Telephone)")
for item in pattern.finditer(content):
print(item.group())
Expected output:
Campbell, Bellam Associés Inc.
3003 Rue College
Sherbrooke, QC J1M 1T8
The block of texts are always like the pasted one and there is no flag attached to it using which I opt for positive lookbehind so I tried like above instead.
The dot does not match a line break character so you could use a modifier (?s) or use re.S or re.DOTALL
pattern = re.compile(r"(.*)(?=Telephone)", re.S)
or
pattern = re.compile(r"(?s)(.*)(?=Telephone)")
You could also get the match without using a group:
(?s).*(?=Telephone)
Change the line
pattern = re.compile(r"(.*)(?=Telephone)")
To
pattern = re.compile(r"(.*)(?=Telephone)", re.DOTALL)
So that your regex wildcard (*) would match newline characters.
:)

How do I eliminate all the parenthesis in a txt files?

I have a txt file, single COLUMN, taken from excel, of the following type:
AMANDA (LOUDLY SPEAKING)
JEFF
STEVEN (TEASINGLY)
AMANDA
DOC BRIAN GREEN
As output I want:
AMANDA
JEFF
STEVEN
AMANDA
DOC BRIAN GREEN
I tried with a for cycle on all the column and then:
if (str[i] == '('):
return str.split('(')
but it's clearly not working.
Do you have any possible solution? I would then need an output file as my original txt, so with each name for each line in a single column.
Thanks everyone!
(I am using PyCharm 3.2)
I'd use regex in this situation. \w will replace letters, the * will select 0 or more. Then we check that it is between parenthesis.
import re
fi = "AMANDA (LOUDLY) JEFF STEVEN (TEASINGLY) AMANDA"
with open("mytext.txt","r") as fi, open("out.txt", "w") as fo:
for line in fi:
fo.write(re.sub("\(.*?\)", "", line))
You can split the string into a list using a regular expression that matches everything in parentheses or a full word, remove all elements from the list which contain parentheses and then join the list to a string again. The advantage is that there will be no double spaces in the result string where a word in parantheses was removed.
import re
text = "AMANDA (LOUDLY SPEAKING) JEFF STEVEN (TEASINGLY) AMANDA DOC BRIAN GREEN"
words = re.findall("\(.*?\)|[^\s]+",text)
print " ".join([x for x in words if "(" not in x])

Removing white space from txt with python

I have a .txt file (scraped as pre-formatted text from a website) where the data looks like this:
B, NICKOLAS CT144531X D1026 JUDGE ANNIE WHITE JOHNSON
ANDREWS VS BALL JA-15-0050 D0015 JUDGE EDWARD A ROBERTS
I'd like to remove all extra spaces (they're actually different number of spaces, not tabs) in between the columns. I'd also then like to replace it with some delimiter (tab or pipe since there's commas within the data), like so:
ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS
Looked around and found that the best options are using regex or shlex to split. Two similar scenarios:
Python Regular expression must strip whitespace except between quotes,
Remove white spaces from dict : Python.
You can apply the regex '\s{2,}' (two or more whitespace characters) to each line and substitute the matches with a single '|' character.
>>> import re
>>> line = 'ANDREWS VS BALL JA-15-0050 D0015 JUDGE EDWARD A ROBERTS '
>>> re.sub('\s{2,}', '|', line.strip())
'ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS'
Stripping any leading and trailing whitespace from the line before applying re.sub ensures that you won't get '|' characters at the start and end of the line.
Your actual code should look similar to this:
import re
with open(filename) as f:
for line in f:
subbed = re.sub('\s{2,}', '|', line.strip())
# do something here
What about this?
your_string ='ANDREWS VS BALL JA-15-0050 D0015 JUDGE EDWARD A ROBERTS'
print re.sub(r'\s{2,}','|',your_string.strip())
Output:
ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS
Expanation:
I've used re.sub() which takes 3 parameter, a pattern, a string you want to replace with and the string you want to work on.
What I've done is taking at least two space together , I 've replaced them with a | and applied it on your string.
s = """B, NICKOLAS CT144531X D1026 JUDGE ANNIE WHITE JOHNSON
ANDREWS VS BALL JA-15-0050 D0015 JUDGE EDWARD A ROBERTS
"""
# Update
re.sub(r"(\S)\ {2,}(\S)(\n?)", r"\1|\2\3", s)
In [71]: print re.sub(r"(\S)\ {2,}(\S)(\n?)", r"\1|\2\3", s)
B, NICKOLAS|CT144531X|D1026|JUDGE ANNIE WHITE JOHNSON
ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS
Considering there are at least two spaces separating the columns, you can use this:
lines = [
'B, NICKOLAS CT144531X D1026 JUDGE ANNIE WHITE JOHNSON ',
'ANDREWS VS BALL JA-15-0050 D0015 JUDGE EDWARD A ROBERTS '
]
for line in lines:
parts = []
for part in line.split(' '):
part = part.strip()
if part: # checking if stripped part is a non-empty string
parts.append(part)
print('|'.join(parts))
Output for your input:
B, NICKOLAS|CT144531X|D1026|JUDGE ANNIE WHITE JOHNSON
ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS
It looks like your data is in a "text-table" format.
I recommend using the first row to figure out the start point and length of each column (either by hand or write a script with regex to determine the likely columns), then writing a script to iterate the rows of the file, slice the row into column segments, and apply strip to each segment.
If you use a regex, you must keep track of the number of columns and raise an error if any given row has more than the expected number of columns (or a different number than the rest). Splitting on two-or-more spaces will break if a column's value has two-or-more spaces, which is not just entirely possible, but also likely. Text-tables like this aren't designed to be split on a regex, they're designed to be split on the column index positions.
In terms of saving the data, you can use the csv module to write/read into a csv file. That will let you handle quoting and escaping characters better than specifying a delimiter. If one of your columns has a | character as a value, unless you're encoding the data with a strategy that handles escapes or quoted literals, your output will break on read.
Parsing the text above would look something like this (i nested a list comprehension with brackets instead of the traditional format so it's easier to understand):
cols = ((0,34),
(34, 50),
(50, 59),
(59, None),
)
for line in lines:
cleaned = [i.strip() for i in [line[s:e] for (s, e) in cols]]
print cleaned
then you can write it with something like:
import csv
with open('output.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter='|',
quotechar='"', quoting=csv.QUOTE_MINIMAL)
for line in lines:
spamwriter.writerow([line[col_start:col_end].strip()
for (col_start, col_end) in cols
])
Looks like this library can solve this quite nicely:
http://docs.astropy.org/en/stable/io/ascii/fixed_width_gallery.html#fixed-width-gallery
Impressive...

Categories

Resources