How to clean the tweets having a specific but varying length pattern? - python

I pulled out some tweets for analysis. When I separate the words in tweets I can see a lot of following expressions in my output:
\xe3\x81\x86\xe3\x81\xa1
I want to use regular expressions to replace these patterns with nothing. I am not very good with regex. I tried using solution in some similar questions but nothing worked for me. They are replacing characters like "xt" from "extra".
I am looking for something that will replace \x?? with nothing, considering ?? can be either a-f or 0-9 but word must be 4 letter and starting with \x.
Also i would like to add replacement for anything other than alphabets. Like:
"Hi!! my number is (7097868709809)."
after replacement should yield
"Hi my number is."
Input:
\xe3\x81\x86\xe3Extra
Output required:
Extra

What you are seeing is Unicode characters that can't directly be printed, expressed as pairs of hexadecimal digits. So for a more printable example:
>>> ord('a')
97
>>> hex(97)
'0x61'
>>> "\x61"
'a'
Note that what appears to be a sequence of four characters '\x61' evaluates to a single character, 'a'. Therefore:
?? can't "be anything" - they can be '0'-'9' or 'a'-'f'; and
Although e.g. r'\\x[0-9a-f]{2}' would match the sequence you see, that's not what the regex would parse - each "word" is really a single character.
You can remove the characters "other than alphabets" using e.g. string.printable:
>>> s = "foo\xe3\x81"
>>> s
'foo\xe3\x81'
>>> import string
>>> valid_chars = set(string.printable)
>>> "".join([c for c in s if c in valid_chars])
'foo'
Note that e.g. '\xe3' can be directly printed in Python 3 (it's 'ã'), but isn't included in string.printable. For more on Unicode in Python, see the docs.

Related

Regex to get all occurrences of a pattern followed by a value in a comma separate string

This is in python
Input string:
Str = 'Y=DAT,X=ZANG,FU=_COG-GAB-CANE-,FU=FARE,T=TART,RO=TOP,FU=#-_MAP.com-,Z=TRY'
Expected output
'FU=_COG-GAB-CANE_,FU=FARE,FU=#-_MAP.com_'
here 'FU=' is the occurence we are looking for and the value which follows FU=
return all occurrences of FU=(with the associated value for FU=) in a comma-separated string, they can occur anywhere within the string and special characters are allowed.
Here is one approach.
>>> import re
>>> str_ = 'Y=DAT,X=ZANG,FU=FAT,T=TART,FU=GEM,RO=TOP,FU=MAP,Z=TRY'
>>> re.findall.__doc__[:58]
'Return a list of all non-overlapping matches in the string'
>>> re.findall(r'FU=\w+', str_)
['FU=FAT', 'FU=GEM', 'FU=MAP']
>>> ','.join(re.findall(r'FU=\w+', str_))
'FU=FAT,FU=GEM,FU=MAP'
Got it working
Python Code
import re
str_ = 'Y=DAT,X=ZANG,FU=_COG-GAB-CANE-,FU=FARE,T=TART,RO=TOP,FU=#-_MAP.com-,Z=TRY'
str2='FU='+',FU='.join(re.findall(r'FU=(.*?),', str_))
print(str2)
Gives the desired output:
'FU=_COG-GAB-CANE-,FU=FARE,FU=#-_MAP.com-'
Successfully gives me all the occurrences of FU= followed by values, irrespective of order and number of special characters.
Although a bit unclean way as I am manually adding FU= for the first occurrence.
Please suggest if there is a cleaner way of doing it ? , but yes it gets the work done.

How do you find all instances of a substring, followed by a certain number of dynamic characters?

I'm trying to find all instances of a specific substring(a!b2 as an example) and return them with the 4 characters that follow after the substring match. These 4 following characters are always dynamic and can be any letter/digit/symbol.
I've tried searching, but it seems like the similar questions that are asked are requesting help with certain characters that can easily split a substring, but since the characters I'm looking for are dynamic, I'm not sure how to write the regex.
When using regex, you can use "." to dynamically match any character. Use {number} to specify how many characters to match, and use parentheses as in (.{number}) to specify that the match should be captured for later use.
>>> import re
>>> s = "a!b2foobar a!b2bazqux a!b2spam and eggs"
>>> print(re.findall("a!b2(.{4})", s))
['foob', 'bazq', 'spam']
import re
print (re.search(r'a!b2(.{4})')).group(1))
.{4} matches any 4 characters except special characters.
group(0) is the complete match of the searched string. You can read about group id here.
If you're only looking for how to grab the following 4 characters using Regex, what you are probably looking to use is the curly brace indicator for quantity to match: '{}'.
They go into more detail in the post here, but essentially you would do [a-Z][0-9]{X,Y} or (.{X,Y}), where X to Y is the number of characters you're looking for (in your case, you would only need {4}).
A more Pythonic way to solve this problem would be to make use of string slicing, and the index function however.
Eg. given an input_string, when you find the substring at index i using index, then you could use input_string[i+len(sub_str):i+len(sub_str)+4] to grab those special characters.
As an example,
input_string = 'abcdefg'
sub_str = 'abcd'
found_index = input_string.index(sub_str)
start_index = found_index + len(sub_str)
symbol = input_string[start_index: start_index + 4]
Outputs (to show it works with <4 as well): efg
Index also allows you to give start and end indexes for the search, so you could also use it in a loop if you wanted to find it for every sub string, with the start of the search index being the previous found index + 1.

regex python pattern error

I am using the following regular expression pattern for searching in a text file :
hexadecimal numbers (to find : 1a2bc3d4e5 or 2369.235.26.158963 or Aaa4 )
only letters "a" or spaces. There may be "a", spaces or a mixture of
two but nothing else. :
Below my regex for hexadecimal numbers :
matches = re.compile(' 0[xX][0-9a-fA-F]+ ')
Below my regex for second pattern :
matches = re.compile(r'^[a| ]*$')
Unfortunately, it does not work.
Thanks in advance for your help
Honestly, sometimes I think it's best when asking questions to include some of the actual input (or something close to it) and the desired output. For your hex numbers I'm wondering if you want to capture the 0x which precedes the value or avoid it; secondly variable length hex with your regex prototype (slightly corrected) would capture things like 'def', 'bad', etc. Anyway, having input and desired output helps with understanding the problem. The same can be said for people who answer.
With that said, for your first regex (cause I couldn't understand what you wanted for the second), I tend to prefer using "findall" cause it's more direct and yields group matching, so with the following input (presuming you know I'm creating a string in place of using the file.read() method and making my regex capture strings of more than 4 characters)
Code
import re
input = '''This is a hex number 0xAF67E49
This is NOT a hex number tgey736zde
This hex number 0xb34df49a appears in the middle of a sentence
This could be a hex number but has no letters 3689320'''
matches1 = re.findall('([0-9a-fA-F]{4,})', input)
matches2 = re.findall('0x([0-9a-fA-F]{4,})', input)
matches3 = re.findall('(0x[0-9a-fA-F]{4,})', input)
print('matches1: %s' % (str(matches1)))
print('matches2: %s' % (str(matches2)))
print('matches3: %s' % (str(matches3)))
Output
matches1: ['AF67E49', 'b34df49a', '3689320']
matches2: ['AF67E49', 'b34df49a']
matches3: ['0xAF67E49', '0xb34df49a']
Explanation
matches1 indiscriminately matches anything that is 4 or more characters and falls within the hex range. Experiment with this by changing "tgey736zde" in the input to "tgey736de"
matches2 effectively says capture any hex string of more than 4 characters preceded by 0x, ignoring the 0x
matches3 effectively says capture any hex string of more than 4 characters preceded by 0x, but include the 0x
Extra Information
To make this more effective, you might want to research how to use lookaheads as well

How to remove set of characters when a string comprise of "\" and Special characters in python

a = "\Virtual Disks\DG2_ASM04\ACTIVE"
From the above string I would like to get the part "DG2_ASM04" alone. I cannot split or strip as it has the special characters "\", "\D" and "\A" in it.
Have tried the below and can't get the desired output.
a.lstrip("\Virtual Disks\\").rstrip("\ACTIVE")
the output I have got is: 'G2_ASM04' instead of "DG2_ASM04"
Simply use slicing and escape backslash(\)
>>> a.split("\\")[-2]
'DG2_ASM04'
In your case D is also removing because it is occurring more than one time in given string (thus striping D as well). If you tweak your string then you will realize what is happening
>>> a = "\Virtual Disks\XG2_ASM04\ACTIVE"
>>> a.lstrip('\\Virtual Disks\\').rstrip("\\ACTIVE")
'XG2_ASM04'

grep prefix python strings

I'm having problems using findall in python.
I have a text such as:
the name of 33e4853h45y45 is one of the 33e445a64b65 and we want all the 33e5c44598e46 to be matched
So i'm trying to find all occurrences of of those alphanumeric strings in the text. the thing is I know they all have the "33e" prefix.
Right now, I have strings = re.findall(r"(33e+)+", stdout_value) but it doesn't work.
I want to be able to return 33e445a64b65, 33e5c44598e46
try this
>>> x="the name of 33e4853h45y45 is one of the 33e445a64b65 and we want all the 33e5c44598e46 to be matched"
>>> re.findall("33e\w+",x)
['33e4853h45y45', '33e445a64b65', '33e5c44598e46']
Here's a slight variation:
>>> string = '''the name of 33e4853h45y45 is one of the 33e445a64b65 and we want all the 33e5c44598e46 to be matched'''
>>> re.findall(r"(33e[a-z0-9]+)", string)
['33e4853h45y45', '33e445a64b65', '33e5c44598e46']
Instead of matching any word characters, it will only match digits and lowercase numbers after the 33e -- that's what the [a-z0-9]+ means.
If you wanted to also match capital letters, you could replace that part with [a-zA-Z0-9]+ instead.

Categories

Resources