String slicing with delimiter changing in length - python

Probably an easy one, but I didn't find a solution by now. My text file that I import into python has a space delimited structure such as
20.06.2009 05:00:00 2.6
20.06.2009 06:00:00 21.5
I want to split this into a time and a value variable. Slicing the time component is straightforward
time = ""
value = ""
for i in lines:
time += i[0:20]
But I can't find a solution for the value component as it contains mostly 3 digits, but sometimes 4, so the number of space delimiters change between time and value (that's why the re package doesn't work here). Any solutions?

You can use rsplit(' ', 1) on your string to split based on the last occurrence of a whitespace in your string:
So you could do:
x = '20.06.2009 05:00:00 2.6'
y = '20.06.2009 06:00:00 21.5'
items = [x, y]
value = 0
for item in items:
value += float(item.rsplit(' ', 1)[1])
print(value)
Output
24.1

You can use the strip function which removes all spaces:
number += float(i[21:].strip())
This works also if you have spaces at the end of line.
There is also the .split() functions which splits
a line at every space like character or whatever you need.

You can use to .split() the entire line with list1 = x.split(' '). What you get at the end is a list.
list1 = ['20.06.2009', '05:00:00', ' 2.6']
So, you can look for spaces on the 3rd element and get rid of em with .replace()

Related

Want to replace comma with decimal point in text file where after each number there is a comma in python

eg
Arun,Mishra,108,23,34,45,56,Mumbai
o\p I want is
Arun,Mishra,108.23,34,45,56,Mumbai
Tried to replace the comma with dot but all the demiliters are replaced with comma
tried text.replace(',','.') but replacing all the commas with dot
You can use regex for these kind of tasks:
import re
old_str = 'Arun,Mishra,108,23,34,45,56,Mumbai'
new_str = re.sub(r'(\d+)(,)(\d+)', r'\1.\3', old_str, 1)
>>> 'Arun,Mishra,108.23,34,45,56,Mumbai'
The search pattern r'(\d+)(,)(\d+)' was to find a comma between two numbers. There are three capture groups, therefore one can use them in the replacement: r\1.\3 (\1 and \3 are first and third groups). The old_str is the string and 1 is to tell the pattern to only replace the first occurrence (thus keep 34, 45).
It may be instructive to show how this can be done without additional module imports.
The idea is to search the string for all/any commas. Once the index of a comma has been identified, examine the characters either side (checking for digits). If such a pattern is observed, modify the string accordingly
s = 'Arun,Mishra,108,23,34,45,56,Mumbai'
pos = 1
while (pos := s.find(',', pos, len(s)-1)) > 0:
if s[pos-1].isdigit() and s[pos+1].isdigit():
s = s[:pos] + '.' + s[pos+1:]
break
pos += 1
print(s)
Output:
Arun,Mishra,108.23,34,45,56,Mumbai
Assuming you have a plain CSV file as in your single line example, we can assume there are 8 columns and you want to 'merge' columns 3 and 4 together. You can do this with a regular expression - as shown below.
Here I explicitly match the 8 columns into 8 groups - matching everything that is not a comma as a column value and then write out the 8 columns again with commas separating all except columns 3 and 4 where I put the period/dot you require.
$ echo "Arun,Mishra,108,23,34,45,56,Mumbai" | sed -r "s/([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*)/\1,\2,\3.\4,\5,\6,\7,\8/"
Arun,Mishra,108.23,34,45,56,Mumbai
This regex is for your exact data. Having a generic regex to replace any comma between two subsequent sets of digits might give false matches on other data however so I think explicitly matching the data based on the exact columns you have will be the safest way to do it.
You can take the above regex and code it into your python code as shown below.
import re
inLine = 'Arun,Mishra,108,23,34,45,56,Mumbai'
outLine = re.sub(r'([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*)'
, r'\1,\2,\3.\4,\5,\6,\7,\8', inLine, 0)
print(outLine)
As Tim Biegeleisen pointed out in an original comment, if you have access to the original source data you would be better fixing the formatting there. Of course that is not always possible.
First split the string using s.split() and then replace ',' in 2nd element
after replacing join the string back again.
s= 'Arun,Mishra,108,23,34,45,56,Mumbai '
ls = s.split(',')
ls[2] = '.'.join([ls[2], ls[3]])
ls.pop(3)
s = ','.join(ls)
It changes all the commas to dots if dot have numbers before and after itself.
txt = "2459,12 is the best number. lets change the dots . with commas , 458,45."
commaindex = 0
while commaindex != -1:
commaindex = txt.find(",",commaindex+1)
if txt[commaindex-1].isnumeric() and txt[commaindex+1].isnumeric():
txt = txt[0:commaindex] + "." + txt[commaindex+1:len(txt)+1]
print(txt)

How to partial split and take the first portion of string in Python?

Have a scenario where I wanted to split a string partially and pick up the 1st portion of the string.
Say String could be like aloha_maui_d0_b0 or new_york_d9_b10. Note: After d its numerical and it could be any size.
I wanted to partially strip any string before _d* i.e. wanted only _d0_b0 or _d9_b10.
Tried below code, but obviously it removes the split term as well.
print(("aloha_maui_d0_b0").split("_d"))
#Output is : ['aloha_maui', '0_b0']
#But Wanted : _d0_b0
Is there any other way to get the partial portion? Do I need to try out in regexp?
How about just
stArr = "aloha_maui_d0_b0".split("_d")
st2 = '_d' + stArr[1]
This should do the trick if the string always has a '_d' in it
You can use index() to split in 2 parts:
s = 'aloha_maui_d0_b0'
idx = s.index('_d')
l = [s[:idx], s[idx:]]
# l = ['aloha_maui', '_d0_b0']
Edit: You can also use this if you have multiple _d in your string:
s = 'aloha_maui_d0_b0_d1_b1_d2_b2'
idxs = [n for n in range(len(s)) if n == 0 or s.find('_d', n) == n]
parts = [s[i:j] for i,j in zip(idxs, idxs[1:]+[None])]
# parts = ['aloha_maui', '_d0_b0', '_d1_b1', '_d2_b2']
I have two suggestions.
partition()
Use the method partition() to get a tuple containing the delimiter as one of the elements and use the + operator to get the String you want:
teste1 = 'aloha_maui_d0_b0'
partitiontest = teste1.partition('_d')
print(partitiontest)
print(partitiontest[1] + partitiontest[2])
Output:
('aloha_maui', '_d', '0_b0')
_d0_b0
The partition() methods returns a tuple with the first element being what is before the delimiter, the second being the delimiter itself and the third being what is after the delimiter.
The method does that to the first case of the delimiter it finds on the String, so you can't use it to split in more than 3 without extra work on the code. For that my second suggestion would be better.
replace()
Use the method replace() to insert an extra character (or characters) right before your delimiter (_d) and use these as the delimiter on the split() method.
teste2 = 'new_york_d9_b10'
replacetest = teste2.replace('_d', '|_d')
print(replacetest)
splitlist = replacetest.split('|')
print(splitlist)
Output:
new_york|_d9_b10
['new_york', '_d9_b10']
Since it replaces all cases of _d on the String for |_d there is no problem on using it to split in more than 2.
Problem?
A situation to which you may need to be careful would be for unwanted splits because of _d being present in more places than anticipated.
Following the apparent logic of your examples with city names and numericals, you might have something like this:
teste3 = 'rio_de_janeiro_d3_b32'
replacetest = teste3.replace('_d', '|_d')
print(replacetest)
splitlist = replacetest.split('|')
print(splitlist)
Output:
rio|_de_janeiro|_d3_b32
['rio', '_de_janeiro', '_d3_b32']
Assuming you always have the numerical on the end of the String and _d won't happen inside the numerical, rpartition() could be a solution:
rpartitiontest = teste3.rpartition('_d')
print(rpartitiontest)
print(rpartitiontest[1] + rpartitiontest[2])
Output:
('rio_de_janeiro', '_d', '3_b32')
_d3_b32
Since rpartition() starts the search on the String's end and only takes the first match to separate the terms into a tuple, you won't have to worry about the first term (city's name?) causing unexpected splits.
Use regex's split and keep delimiters capability:
import re
patre = re.compile(r"(_d\d)")
#👆 👆
#note the surrounding parenthesises - they're what drives "keep"
for line in """aloha_maui_d0_b0 new_york_d9_b10""".split():
parts = patre.split(line)
print("\n", line)
print(parts)
p1, p2 = parts[0], "".join(parts[1:])
print(p1, p2)
output:
aloha_maui_d0_b0
['aloha_maui', '_d0', '_b0']
aloha_maui _d0_b0
new_york_d9_b10
['new_york', '_d9', '_b10']
new_york _d9_b10
credit due: https://stackoverflow.com/a/15668433

RegEx for matching a datetime followed by spaces and any chars

I need to profile some data in a bucket, and have come across a bit of a dilemma.
This is the type of line in each file:
"2018-09-08 10:34:49 10.0 MiB path/of/a/directory"
What's required is to capture everything in bold while keeping in mind that some of the separators are tabs and other times they are spaces.
To rephrase, I need everything from the moment the date and time end (excluding the tab or space preceding it)
I tried something like this:
p = re.compile(r'^[\d\d\d\d.\d\d.\d\d\s\d\d:\d\d:\d\d].*')
for line in lines:
print(re.findall(line))
How do I solve this problem?
EDIT:
What if I wanted to also create new groups into that the newly matched string? Say I wanted to recreate it to -->
10MiB engagementName/folder/file/something.xlsx engagementName extensionType something.xlsx
RE-EDIT:
The path/to/directory generally points to a file(and all files have extensions). from the reformatted string you guys have been helping me with, is there a way to keep building on the regex pattern to allow me to "create" a new group through the filtering on the fileExtensionType(I suppose by searching the end of the string for somthing along the lines of .anything) and adding that result into the formatted regex string?
Don't bother with a regular expression. You know the format of the line. Just split it:
from datetime import datetime
for l in lines:
line_date, line_time, rest_of_line = l.split(maxsplit=2)
print([line_date, line_time, rest_of_line])
# ['2018-09-08', '10:34:49', '10.0 MiB path/of/a/directory']
Take special note of the use of the maxsplit argument. This prevents it from splitting the size or the path. We can do this because we know the date has one space in the middle and one space after it.
If the size will always have one space in the middle and one space following it, we can increase it to 4 splits to separate the size, too:
for l in lines:
line_date, line_time, size_quantity, size_units, line_path = l.split(maxsplit=4)
print([line_date, line_time, size_quantity, size_units, line_path])
# ['2018-09-08', '10:34:49', '10.0', 'MiB', 'path/of/a/directory']
Note that extra contiguous spaces and spaces in the path don't screw it up:
l = "2018-09-08 10:34:49 10.0 MiB path/of/a/direct ory"
line_date, line_time, size_quantity, size_units, line_path = l.split(maxsplit=4)
print([line_date, line_time, size_quantity, size_units, line_path])
# ['2018-09-08', '10:34:49', '10.0', 'MiB', 'path/of/a/direct ory']
You can concatenate parts back together if needed:
line_size = size_quantity + ' ' + size_units
If you want the timestamp for something, you can parse it:
# 'T' could be anything, but 'T' is standard for the ISO 8601 format
timestamp = datetime.strptime(line_date + 'T' + line_time, '%Y-%m-%dT%H:%M:%S')
You might not need an expression to do so, a string split would suffice. However, if you wish to do so, you might not want to bound your expression from very beginning. You can simply use this expression:
(:[0-9]+\s+)(.*)$
You can even slightly modify it to this expression which is just a bit faster:
:([0-9]+\s+)(.*)$
Graph
The graph shows how the expression works:
Example Test:
# -*- coding: UTF-8 -*-
import re
string = "2018-09-08 10:34:49 10.0 MiB path/of/a/directory"
expression = r'(:[0-9]+\s+)(.*)$'
match = re.search(expression, string)
if match:
print("YAAAY! \"" + match.group(2) + "\" is a match 💚 ")
else:
print('🙀 Sorry! No matches! Something is not right! Call 911 👮')
Output
YAAAY! "10.0 MiB path/of/a/directory" is a match 💚
JavaScript Performance Benchmark
This snippet is a JavaScript performance test with 10 million times repetition of your input string:
repeat = 10000000;
start = Date.now();
for (var i = repeat; i >= 0; i--) {
var string = "2018-09-08 10:34:49 10.0 MiB path/of/a/directory";
var regex = /(.*)(:[0-9]+\s+)(.*)/g;
var match = string.replace(regex, "$3");
}
end = Date.now() - start;
console.log("YAAAY! \"" + match + "\" is a match 💚 ");
console.log(end / 1000 + " is the runtime of " + repeat + " times benchmark test. 😳 ");
Edit:
You might only capture the end of a timestamp, because your expression would have less boundaries, it becomes simple and faster, and in case there was unexpected instances, it would still work:
2019/12/15 10:00:00 **desired output**
2019-12-15 10:00:00 **desired output**
2019-12-15, 10:00:00 **desired output**
2019-12 15 10:00:00 **desired output**

How does raw_input().strip().split() in Python work in this code?

Hopefully, the community might explain this better to me. Below is the objective, I am trying to make sense of this code given the objective.
Objective: Initialize your list and read in the value of followed by lines of commands where each command will be of the types listed above. Iterate through each command in order and perform the corresponding operation on your list.
Sample input:
12
insert 0 5
insert 1 10
etc.
Sample output:
[5, 10]
etc.
The first line contains an integer, n, denoting the number of commands.
Each line of the subsequent lines contains one of the commands described above.
Code:
n = int(raw_input().strip())
List = []
for number in range(n):
args = raw_input().strip().split(" ")
if args[0] == "append":
List.append(int(args[1]))
elif args[0] == "insert":
List.insert(int(args[1]), int(args[2]))
So this is my interpretation of the variable "args." You take the raw input from the user, then remove the white spaces from the raw input. Once that is removed, the split function put the string into a list.
If my raw input was "insert 0 5," wouldn't strip() turn it into "insert05" ?
In python you use a split(delimiter) method onto a string in order to get a list based in the delimiter that you specified (by default is the space character) and the strip() method removes the white spaces at the end and beginning of a string
So step by step the operations are:
raw_input() #' insert 0 5 '
raw_input().strip() #'insert 0 5'
raw_input().strip().split() #['insert', '0', '5']
you can use split(';') by example if you want to convert strings delimited by semicolons 'insert;0;5'
Let's take an example, you take raw input
string=' I am a coder '
While you take input in form of a string, at first, strip() consumes input i.e. string.strip() makes it
string='I am a coder'
since spaces at the front and end are removed.
Now, split() is used to split the stripped string into a list
i.e.
string=['I', 'am', 'a', 'coder']
Nope, that would be .remove(" "), .strip() just gets rid of white space at the beginning and end of the string.

Python: Remove First Character of each Word in String

I am trying to figure out how to remove the first character of a words in a string.
My program reads in a string.
Suppose the input is :
this is demo
My intention is to remove the first character of each word of the string, that is
tid, leaving his s emo.
I have tried
Using a for loop and traversing the string
Checking for space in the string using isspace() function.
Storing the index of the letter which is encountered after the
space, i = char + 1, where char is the index of space.
Then, trying to remove the empty space using str_replaced = str[i:].
But it removed the entire string except the last one.
List comprehensions is your friend. This is the most basic version, in just one line
str = "this is demo";
print " ".join([x[1:] for x in str.split(" ")]);
output:
his s emo
In case the input string can have not only spaces, but also newlines or tabs, I'd use regex.
In [1]: inp = '''Suppose we have a
...: multiline input...'''
In [2]: import re
In [3]: print re.sub(r'(?<=\b)\w', '', inp)
uppose e ave
ultiline nput...
You can simply using python comprehension
str = 'this is demo'
mstr = ' '.join([s[1:] for s in str.split(' ')])
then mstr variable will contains these values 'his s emo'
This method is a bit long, but easy to understand. The flag variable stores if the character is a space. If it is, the next letter must be removed
s = "alpha beta charlie"
t = ""
flag = 0
for x in range(1,len(s)):
if(flag==0):
t+=s[x]
else:
flag = 0
if(s[x]==" "):
flag = 1
print(t)
output
lpha eta harlie

Categories

Resources