Python/PySpark String Split Index on key:value pair modification

Python/PySpark String Split Index on key:value pair modification - python

The following split/index will retrieve the following the output 'accountv2' from
Ancestor:{'ancestorPath': '/mnt/lake/RAW/Internal/origination/dbo/accountv2/1/Year=2023/Month=2/Day=2/Time=04-09', 'dfConfig': '{"sparkConfig":{"header":"true"}}', 'fileFormat': 'SQL'}
The split/index code is as follows:
Ancestor['ancestorPath'].split("/")[7]
Can somene help modify the split/index so that it strips off the last two characters i.e v2.
So the output will be account not accountv2
Thanks

If this is nothing you need to do in an automated pipeline, I'd suggest using
string1 = "accountv2"
string2 = string1[:-2]
Which cuts off the last two characters.

Related

How to grab a specified number of characters after a part of a specified string?

Let's say I have a string defined like this:
string1 = '23h4b245hjrandomstring345jk3n45jkotherrandomstring'
The goal is to grab the 11 characters (these for example '345jk3n45jk') after a part of the string (this part for example 'randomstring') using a specified search term and the specified number of characters to grab after that search term.
I tried doing something like this:
string2 = substring(string1,'randomstring', 11)
I appreciate any help you guys have to offer!

string2 = string1[string1.find("randomstring")+len("randomstring"):string1.find("randomstring")+len("randomstring")+11]

In one line, using split, and supposing that your randomstring is unique in your string, which seems to be the case as you worded out the question :
string1 = '23h4b245hjrandomstring345jk3n45jkotherrandomstring'
randomstring = 'randomstring'
nb_char_to_take = 11
# split using randomstring as splitter, take part of the string after, i.e the second part of the array, and then the 11 first character
result = string1.split(randomstring)[1][:nb_char_to_take]

You can use a simple regular expression like this
import re
s = "23h4b245hjrandomstring345jk3n45jkotherrandomstring"
result = re.findall("randomstring(.{11})", s)[0]

string1 = '23h4b245hjrandomstring345jk3n45jkotherrandomstring'
string2 = string1[10:22]
print(string2)
randomstring
You could use that. Its called string slicing, you basically count the position of the letters and then the first number before the colon is your starting point the second is your ending point when you enter those position numbers you should get whatever is in-between those position, the last is for a different function I highly suggest you search string slicing on YouTube as my explanation wouldn't really help you, and also search up * Find string method* those should hep you get the idea behind those functions. Sorry couldn't be of much help hope the videos help.

List of strings: remove and split elements to extract words

I have the following two different cases of list of strings:
my_list1=['_','net_my_name','_64', '_66']
my_list2=['net_another_file']
I would like to extract
net_my_name as my name in case I have type of lists such as my_list1;
net_another_file as another file in case I have type of lists such as my_list2.
To do so, I was thinking of:
in case I find a situation like that one described by my_list1, then remove elements that are numerical, then split on _ to take the last two items (i.e. my name);
in case I find a situation like that one described by my_list2, then split on _ to take the last two items (i.e. another file).
If I removed numerical values, where they occur, I would have my_name as last word, i.e. my name as last two words.
Expected output:
my name
another file
Can you please tell me how to 'translate' in code the steps above? Thank you

Consider this code:
import re
string = "net_another_file777"
string = re.sub("[0-9]", "", string) # "net_another_file"
L = string.split('_')[-2:] # ['another', 'file']
Now you have just to go through the list and aply this to every element in the list.
Hope this helps you.

how to get second last and last value in a string after separator in python

In Python, how do you get the last and second last element in string ?
string "client_user_username_type_1234567"
expected output : "type_1234567"

Try this :
>>> s = "client_user_username_type_1234567"
>>> '_'.join(s.split('_')[-2:])
'type_1234567'

You can also use re.findall:
import re
s = "client_user_username_type_1234567"
result = re.findall('[a-zA-Z]+_\d+$', s)[0]
Output:
'type_1234567'

There's no set function that will do this for you, you have to use what Python gives you and for that I present:
split slice and join
"_".join("one_two_three".split("_")[-2:])
In steps:
Split the string by the common separator, "_"
s.split("_")
Slice the list so that you get the last two elements by using a negative index
s.split("_")[-2:]
Now you have a list composed of the last two elements, now you have to merge that list again so it's like the original string, with separator "_".
"_".join("one_two_three".split("_")[-2:])
That's pretty much it. Another way to investigate is through regex.

Python 2.7: How to split on first occurrence?

I am trying to split a string I extract on the first occurrence of a comma. I have tried using the split, but something is wrong, as it doesn't split.
for i in range(len(items)):
alldata = items[i].getText().encode('utf-8').split(',', 1)
csvfile.writerow(alldata)
The variable items contains the data I extract from an URL. The output in the CSV file is put in one column. I want it to be on two columns. An example of the data (alldata) I get in the CSV file, looks like this:
['\n\n\n1958\n\n\nGeorge Lees\n']
Using this data as an example, I need the year 1958 to be on one column, and the name George Lees to be on another column instead of the new lines.
EDIT
Forgot to mention what I meant with the commas. The reason why I mentioned the commas is that I also tried splitting on whitespaces. When I did that I got the data:
['1958', 'George', 'Lees']
So what I tried to achieve was to split the data on the first comma occurrence. That's why I did split(',', 1) forgetting that I also need to split on whitespaces. So my problem is that I don't know how I split on both the first commas occurence, so that the year is on oe column and the whole name is on another column. I got
['\n\n\n1958\n\n\nGeorge Lees\n']
When I tried to split with split(',', 1)

You can use strip to remove all spaces in the start & end and then use split by "\n" to get the required output. I have also used the filter method to remove any empty string or values.
Ex:
A = ['\n\n\n1958\n\n\nGeorge Lees\n']
print filter(None, A[0].strip().split("\n"))
Output:
['1958', 'George Lees']

Using Python 2.4.3: Want to find the same regex multiple times in a text file

Super NOOB to Python (2.4.3): I am executing a function containing a regular expression which searches through a txt file that I'm importing. I am able to read and run re.search on the text file and the output is correct. I need to fun this for multiple occurrences. The regex occurs 48 times in the text). The code is as follows:
!/usr/bin/python
import re
dataRead = open('pd_usage_14-04-23.txt', 'r')
dataWrite = open('test_write.txt', 'w')
text = (dataRead.read()) #reads and initializes text for conversion to string
s = str(text) #converts text to string for reading
def user(str):
re1='((?:[a-z][a-z]+))' # Word 1
re2='(\\s+)' # White Space 1
re3='((?:[a-z][a-z]+))' # Word 2
re4='(\\s+)' # White Space 2
re5='((?:[a-z][a-z]*[0-9]+[a-z0-9]*))' # Alphanum 1
rg = re.compile(re1+re2+re3+re4+re5,re.IGNORECASE|re.DOTALL)
#alphanum1=rg.group(5)
re.findall(rg, s, flags=0)
#print "("+alphanum1+")"+"\n"
#if m:
#word1=m.group(1)
#ws1=m.group(2)
#word2=m.group(3)
#ws2=m.group(4)
#alphanum1=m.group(5)
#print "("+alphanum1+")"+"\n"
return
user(s)
dataRead.close()
dataWrite.close()
OUTPUT: g706454
THIS OUTPUT IS CORRECT! BUT...!
I need to run it multiple times reading text thats further down.
I have 2 other definitions that need to be ran multiple times also. I need all 3 to run consecutively, and then run again but starting with the next line or something to search and output newer data. All the logic I tried implement returns the same output.
So I have something like this:
for count in range (0,47):
if stop_read:
date(s)
usage(s)
user(s)
stop_read is a definition that finds the next line after the data that I'm looking for (date, usage, user). I figured I could call this to say If you hit stop_read, read the next line and run definitions all over again.
Any help is greatly appreciated!

Here is what I do for a regex in Python 3, should be similar to Python 2. This is for a multiline searc.
regex = re.compile("\\w+-\\d+\\b", re.MULTILINE)
Then later on in code I have something like:
myset.update([m.group(0) for m in regex.finditer(logmsg.text)])
Maybe you might want to update your Python if you can, 2.4 is old, old, and stale.

looks like re.findall would solve your problem:
re.findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python/PySpark String Split Index on key:value pair modification - python

If this is nothing you need to do in an automated pipeline, I'd suggest using string1 = "accountv2" string2 = string1[:-2] Which cuts off the last two characters.

Related

How to grab a specified number of characters after a part of a specified string?

List of strings: remove and split elements to extract words

how to get second last and last value in a string after separator in python

Python 2.7: How to split on first occurrence?

Using Python 2.4.3: Want to find the same regex multiple times in a text file

Categories

Resources