Python 2.7: How to split on first occurrence? - python

I am trying to split a string I extract on the first occurrence of a comma. I have tried using the split, but something is wrong, as it doesn't split.
for i in range(len(items)):
alldata = items[i].getText().encode('utf-8').split(',', 1)
csvfile.writerow(alldata)
The variable items contains the data I extract from an URL. The output in the CSV file is put in one column. I want it to be on two columns. An example of the data (alldata) I get in the CSV file, looks like this:
['\n\n\n1958\n\n\nGeorge Lees\n']
Using this data as an example, I need the year 1958 to be on one column, and the name George Lees to be on another column instead of the new lines.
EDIT
Forgot to mention what I meant with the commas. The reason why I mentioned the commas is that I also tried splitting on whitespaces. When I did that I got the data:
['1958', 'George', 'Lees']
So what I tried to achieve was to split the data on the first comma occurrence. That's why I did split(',', 1) forgetting that I also need to split on whitespaces. So my problem is that I don't know how I split on both the first commas occurence, so that the year is on oe column and the whole name is on another column. I got
['\n\n\n1958\n\n\nGeorge Lees\n']
When I tried to split with split(',', 1)

You can use strip to remove all spaces in the start & end and then use split by "\n" to get the required output. I have also used the filter method to remove any empty string or values.
Ex:
A = ['\n\n\n1958\n\n\nGeorge Lees\n']
print filter(None, A[0].strip().split("\n"))
Output:
['1958', 'George Lees']

Related

how to split a string having more than one separator in python?

I have a pyspark dataframe that contains a column which has strings as shown below
qwe1
tre1.eyyu
cvbn.poiu.sdfg
- A value could be a single string (qwe1)
- A value could have one delimiter, i.e ".", and characters on both side of it.(tre1.eyyu)
- A value could have two delimiters. (cvbn.poiu.sdfg)
code as below
p1 = "<path_to_parquet file>"
df_ref_parquet = spark.read.option('header', True).parquet(p1)
table = [x["FILList"] for x in df_LDR_parquet.rdd.collect()]
fil_cd_left = []
for row in table:
row.split(".")
fil_cd_left.append(row[0:4])
print(fil_cd_left)
I want to create 3 lists out of them.
- hence I have written a script that will iterate over the data frame, split it on "." and create a first list that has all values in the default format as shown above.
- Now I have applied python slicing to get extreme left hand side 4 characters before the delimiter ".", and appended it to another list.
However, I am not able to create another two list that would hold extreme right hand side characters of the delimiter, and the set of chars those are present in between the 2 delimiters.
Please suggest, Please let me know if I was not able to explain properly. I will try to re-phrase.
Note: I have searched in Stackoverflow for other articles, but they don't seem to relate to my scenario.
If you want to have a three list every time, please pre-define it.
for row in table:
out = ['','', '']
for index, word in enumerate(row.split('.', maxsplit=2)):
out[index] = word
print(out)
Output:
['qwe1', '', '']
['tre1', 'eyyu', '']
['cvbn', 'poiu', 'sdfg']

Why does "\n" appear in my string output?

I have elements that I've scraped off of a website and when I print them using the following code, they show up neatly as spaced out elements.
print("\n" + time_element)
prints like this
F
4pm-5:50pm
but when I pass time_element into a dataframe as a column and convert it to a string, the output looks like this
# b' \n F\n \n 4pm-5:50pm\n
I am having trouble understanding why it appears so and how to get rid of this "\n" character. I tried using regex to match the "F" and the "4pm-5:50pm" and I thought this way I could separate out the data I need. But using various methods including
# Define the list and the regex pattern to match
time = df['Time']
pattern = '[A-Z]+'
# Filter out all elements that match the pattern
filtered = [x for x in time if re.match(pattern, x)]
print(filtered)
I get back an empty list.
From my research, I understand the "\n" represents a new line and that there might be invisible characters. However, I'm not understanding more about how they behave so I can get rid of them/around them to extract the data that I need.
When I pass the data to csv format, it prints like this all in one cell
F
4pm-5:50pm
but I still end up in the similar place when it comes to separating out the data that I need.
you can use the function strip() when you extract data from the website to avoid "\n"

Regex to split phrases separated into columns by many whitespaces

I'm hoping to get some regex assistance. I've got lines of columnar text that I'd like to split with regexes. Each column can be phrases of arbitrary characters, separated by a whitespace or maybe even two. Columns are separated by a larger number of whitespaces, perhaps at least 4.
Ultimately, I need to match a date if its in the second column.
Here's an example. I need the date in this column to be the group important_date
Rain in Spain 11/01/2000 90 Days
important_date should not match the date in this next line:
Another line of text 10/15/1990
# EXAMPLE:
import re
regex = r"(.*)\s(?P<important_date>\d{1,2}\/\d{1,2}\/\d{4}).*"
match_this = " Rain in Spain 11/01/2000 90 Days"
not_this = " Another line of text 10/15/1990"
print(f"Finding this date is good:{re.search(regex, match_this).group('important_date')}" )
print(f"But this one should throw an error:{re.search(regex,not_this).group('important_date')}")
I'm also comparing these regexes against lots of other lines of text with various structures, so this is why I don't want to just split on a string of " ". To know I've got the important_date, I need to know that the whole line looks like: one-column, second column is date, maybe another column after the date too.
Doing this with a single regex would also just fit much more easily into the rest of the application. I'm worried that line.split(" ") and checking the resulting list would interfere with other checks going on in this app.
I have not been able to figure out how to write the first part of the regex that captures words with no-more-than-2 spaces between them. Can I use lookaheads for this somehow?
Thank you!
Try this: (?m)^\s*(\w+\s)+\s+(?P<important_date>\d\d/\d\d/\d\d\d\d).*$ (https://regex101.com/r/PnIU3e/3).
I assume that the first column consists of words separated by single spaces, and is separated from the second column by more than one space.
You can split on fields of 2 or more spaces and only use the data if it is the second field:
for x in (match_this, not_this):
te=re.split(r'[ \t]{2,}',x)
if re.match(r'\d{1,2}\/\d{1,2}\/\d{4}', te[2]):
# you have an important date
print(te[2])
else:
# you don't
print('no match')

How to replace these values in lines of text

I have several rows of text. The first row is a header row, and each subsequent line represents the fields of data, each value is separated with a comma. Within each line are one to three dollar values, ranging from single digit dollar values ($4.50) to triple digit ($100,000.34). They are also surrounded by quotes.
206360941,5465685679,"$4,073.77",567845676547,"$88,457.21",34589309683
I need to eliminate the quotations and dollar sign for the money values, as well as the comma inside. The period separator for the decimal value needs to stay, so "$6,801.56" becomes 6801.56
I've used regex to eliminate the dollar sign as well as quotations--
with open("datafile.csv", "r") as file:
data = file.readlines()
for i in data:
i = re.sub('[$"]', '', i)
which then makes the data look like 7545245,6,801.56,3545647
so if I split by a comma, it cuts larger values in two.
['206360941,5465685679,4,073.77,567845676547,88,457.21,34589309683']
I thought about splitting by quotations, doing some more regex and rejoining with .join() but it turns out that only the currency values with a comma contain quotations, the smaller values with no comma do not.
Also, I know I can use re.findall(r'\$\d{1,3}\,\d\d\d\.\d\d', i) to draw out the number format, if I print it, it will output a list like [$100,351.35]
I am just not sure what to do with it after that.
This seems to work:
>>> data = '206360941,5465685679,"$4,073.77",567845676547,"$88,457.21",34589309683'
>>> re.findall(r'"\$((\d+),)*(\d+)(\.\d+)"', data)
[('4,', '4', '073', '.77'), ('88,', '88', '457', '.21')]
>>> re.sub(r'"\$((\d+),)*(\d+)(\.\d+)"', r'\2\3\4', data)
'206360941,5465685679,4073.77,567845676547,88457.21,34589309683'
The idea is to grab the data before and after the decimal point, keeping the latter as well. Then, given that the first group is identical to the second one, just replace with the contents of all groups except the first one. If there are more than one comma, you'll probably need a more dynamic approach.
That's why you need this ((\d+),)* group, which captures a subgroup and the comma. You should replace this whole group with the subgroup.
I'd recommend using csv.reader (or csv.DictReader if you want to do other processing on each column) to read the file as this will parse each column automatically. Once you read the file, you can do your regex on each column so no need to split the line yourself. The default delimiter and quotechar for csv.reader is as you would need, I believe.
Did you try the module locale? As in How do I use Python to convert a string to a number if it has commas in it as thousands separators?
It'll be easier than regex.
First of all you could go about deleting all commas that are inside of quotes.
Pseudo code might look like:
s = Your String
insideQuotes = false;
charIndex = 0;
while (c = nextChar() != null){
if(c == "\""){
insideQuotes = !insideQuotes;
}else if(insideQuotes && c == ","){
s.removeAt(charIndex, "");
charIndex--;
}
}
Now that there are no more commas inside the quotes, you only need to remove the dollar signs and the quotes themselves!
Hope it helps!

Getting wrong data with regex

I'm facing an issue here. Python version 3.7.
https://regex101.com/r/WVxEKM/3
As you can see on regex site, my regex is working great, however, when I try to read the strings with python, I only get the first part, meaning, no values after comma.
Here's my code:
part_number = str(row)
partn = re.search(r"([a-zA-Z0-9 ,-]+)", part_number)
print(partn.group(0))
This is what partn.group(0) is printing:
FMC2H-OHC-100018-00
I need to get the string as regex, with comma and value:
FMC2H-OHC-100018-00, 2
Is it my regex wrong?. What is happening with commas and values?
ROW Values
Here are the row values converted to string, the data retrieve from my db also include parentheses and quotes:
('FMC2H-OHC-100018-00', 2)
('FMC2H-OHC-100027-00', 0)
I don't think the you need to convert the row values to string and then try to parse the result with a regex. The clue was when you said in your update that "Here are the row values converted to string" implying that they're in some other format initially—because the result looks they're actually tuples of two values, a string and an integer.
If that's correct, then you can avoid converting them to strings and then trying to parse it with a regex, because you can get the string you want simply by using the relatively simple built-in string formatting capabilities Python has to do it.
Here's what I mean:
# Raw row data retrieved from database.
rows = [('FMC2H-OHC-100018-00', 2),
('FMC2H-OHC-100027-00', 0),
('FMC2H-OHC-100033-00', 0),
('FMC2H-OHC-100032-00', 20),
('FMC2H-OHC-100017-00', 16)]
for row in rows:
result = '{}, {}'.format(*row) # Convert data in row to a formatted string.
print(result)
Output:
FMC2H-OHC-100018-00, 2
FMC2H-OHC-100027-00, 0
FMC2H-OHC-100033-00, 0
FMC2H-OHC-100032-00, 20
FMC2H-OHC-100017-00, 16
Your problem is that you didn't include the ' in your character group. So this regex matches for example FMC2H-OHC-100018-00 and , 2, but not both together. Also re.search stops searching after it finds the first match. So if you only want the first match, go with:
re.search(r"([\w ',-]+)", part_number)
Where I changed A-Za-z0-9 to \w, because it's shorter and more readable. If you want a list that matches all elements, go with:
re.findall(r"([\w ',-]+)", part_number)

Categories

Resources