python regex to group numbers - python

im trying to caputre numbers from that kind of a string:
"30098.904999 5 ABC Da d 8 06 01 20 00 80 11 C0 04"
first i remove all whitespaces:
test = ' '.join(test.split())
then im trying to apply pattern:
pattern = r"(\d+.\d+) (\d+) ABC Da d 8 (\d\d) (\d\d) (\d\d) (\d\d) (\d\d) (\d\d) (\d\d) (\d\d)"
however, still got none result:
result = re.search(pattern, s)
print ("result: " + str(result.groups(0)))
print ("result: " + str(result.groups(0)))
AttributeError: 'NoneType' object has no attribute 'groups'
if i change frist number to 50.309951, then it works.
First number is a timestamp and amount of numbers in it can vary..
Any help highly welcomed! :)
thx in advance
j.

Why wouldn't you just split the string after removing spaces chars?
test = ' '.join(test.split())
like this?
You will receive an array of items
['30098.904999', '5', 'ABC', 'Da', 'd', '8', '06', '01', '20', '00', '80', '11', 'C0', '04']

That's because of C0 which not matched with \d\d. You can use \d\w for that part. But as a more general approach you can use re.findall() to capture all numbers:
In [24]: test = "30098.904999 5 ABC Da d 8 06 01 20 00 80 11 C0 04"
In [27]: re.findall(r'\d+(?:\.\d+)?', test)
Out[27]: ['30098.904999', '5', '8', '06', '01', '20', '00', '80', '11', '0', '04']
# If you want C0 too:
In [28]: re.findall(r'\w?\d+(?:\.\d+)?', test)
Out[28]: ['30098.904999', '5', '8', '06', '01', '20', '00', '80', '11', 'C0', '04']

You don't need to use split as you can use \s+ to match 1 or more whitespace
Your regex also needs correction.
You can use this:
(\d+\.\d+)\s+(\d+)\s+ABC\s+Da\s+d\s+8\s+(\d{2})\s+(\d{2})\s+(\d{2})\s+(\d{2})\s+(\d{2})\s+(\d{2})\s+([A-Z]\d)\s+(\d{2})
RegEx Demo

Related

Pandas, Python - Assembling a Data Frame with multiple lists from loop

Using loop to collect target data into lists from JSON file. These lists are organized as columns and their values are organized; thus, no manipulation/reorganization is required. Only attaching them horizontally.
#Selecting Data into List
i=1
target = f'{pathway}\calls_{i}.json'
with open(target,'r') as f: #Reading JSON file
data = json.load(f)
specsA=('PreviousDraws',['DrawNumber'])
draw=(glom(data,specsA)) #list type; glom is a package to access nested data in JSON file.
print(draw)
for j in range(0,5):
specsB=('PreviousDraws',['WinningNumbers'],[f'{j}'],['Number'])
number=(glom(data,specsB)) #list type; glom is a package to access nested data in JSON file.
print(number)
#Now assembling lists into a table using pandas
The resulting lists from the code above are as followed below:
#This is from variable draw
[10346, 10345, 10344, 10343, 10342, 10341, 10340, 10339, 10338, 10337, 10336, 10335, 10334, 10333, 10332, 10331, 10330, 10329, 10328, 10327]
#This is from variable number
['22', '9', '4', '1', '1', '14', '5', '3', '2', '8', '2', '1', '4', '9', '4', '4', '3', '13', '7', '14']
['28', '18', '16', '2', '3', '17', '16', '13', '11', '9', '8', '2', '9', '19', '7', '13', '7', '23', '21', '17']
['33', '24', '21', '4', '9', '20', '27', '19', '23', '19', '19', '7', '19', '30', '19', '27', '19', '32', '26', '21']
['35', '30', '28', '11', '21', '23', '33', '26', '35', '37', '27', '12', '20', '31', '22', '34', '22', '36', '27', '25']
['36', '32', '33', '19', '29', '38', '35', '27', '37', '38', '32', '30', '22', '36', '33', '39', '36', '38', '30', '27']
Expected Data Frame table after assembly:
Draw | Number[0] | Number[1] | Number[2] ...
10346 | 22 | 28 |
10345 | 9 | 18 |
10344 | 4 | 16 |
10343 | 1 | 2 |
10342 | 1 | 3 |
My attempt at assembling the table: Organize as dictionary with Series, below:
dct = {'DrawNumbers':pd.Series(draw),
'Index1':pd.Series(number),
'Index2':pd.Series(number),
'Index3':pd.Series(number),
'Index4':pd.Series(number),
'Index5':pd.Series(number)
}
df = pd.DataFrame(dct)
print(df)
Actual result - incorrect due to last list's value being repeated in table's row. So far, only Index5 column is correct, while all index columns are incorrectly represented with index 5's values.
DrawNumbers Index1 Index2 Index3 Index4 Index5
0 10346 36 36 36 36 36
1 10345 32 32 32 32 32
2 10344 33 33 33 33 33
3 10343 19 19 19 19 19
4 10342 29 29 29 29 29
5 10341 38 38 38 38 38
6 10340 35 35 35 35 35
7 10339 27 27 27 27 27
8 10338 37 37 37 37 37
9 10337 38 38 38 38 38
... ... ... ... ... ... ...
Also had tried to change the data type of the number from string to int, but having repeated errors attempted that. Either way, I am stuck and would like to request for assistance.
The problem is that you are overwriting the number variable in the loop, so is no longer available at the end of each iteration, I add a solution adding the column Index in each iteration.
# create an empty dataframe
df = pd.DataFrame()
#Selecting Data into List
i=1
target = f'{pathway}\calls_{i}.json'
with open(target,'r') as f: #Reading JSON file
data = json.load(f)
specsA=('PreviousDraws',['DrawNumber'])
draw=(glom(data,specsA)) #list type; glom is a package to access nested data in JSON file.
print(draw)
# insert the draw to the dataframe
df['DrawNumbers'] = draw
for j in range(0,5):
specsB=('PreviousDraws',['WinningNumbers'],[f'{j}'],['Number'])
number=(glom(data,specsB)) #list type; glom is a package to access nested data in JSON file.
print(number)
# insert each number to the dataframe
df[f'Index{j}'] = number
Assuming that number is a nested list:
number = list(map(list, zip(*number))) # this transposes the nested list so that each list within the list now corresponds to one row of the desired df
pd.DataFrame(data=number, index=draw)
This will output the df in the desired format. Of course you can go ahead and label the columns as you like, etc.

numbers string of list separated with space in Python

I have a txt file that read line by line and the text file has some data as below. I am trying to extract the numbers so that I can process them as in array of something like that. The space in between dates and the start of the numbers are 'tabs" and the numbers have normal spaces.
04/10/2000 02 75 30 54 86
04/16/2000 63 63 48 32 12
04/15/2000 36 47 68 09 40
04/14/2000 06 27 31 36 43
04/13/2000 03 08 41 60 87
Here is the code I wrote:
for line in handle:
line = line.rstrip()
numberlist.append(line.split('\t'))
numbers=list()
numberlist=list()
for x in numberlist:
numbers.append(x[1])
print(numbers[:2]) # just to see sample output
Could someone help me to figure out? I checked some python tutorial even but the more I think about the issue, the more tricky and detailed it seems to me. (please note that this is not a homework)
Thanks
You can split on a tab \t to get 2 parts. Then split the second part on a space to get all the numbers.
strings = [
"04/10/2000 02 75 30 54 86",
"04/16/2000 63 63 48 32 12",
"04/15/2000 36 47 68 09 40",
"04/14/2000 06 27 31 36 43",
"04/13/2000 03 08 41 60 87"
]
numbers = list()
numberlist = list()
for s in strings:
lst = s.rstrip().split("\t")
if len(lst) > 1:
lst = lst[1].split(" ")
numberlist.append(lst)
for n in lst:
numbers.append(n)
print(numbers)
print(numberlist)
Output
['02', '75', '30', '54', '86', '63', '63', '48', '32', '12', '36', '47', '68', '09', '40', '06', '27', '31', '36', '43', '03', '08', '41', '60', '87']
[['02', '75', '30', '54', '86'], ['63', '63', '48', '32', '12'], ['36', '47', '68', '09', '40'], ['06', '27', '31', '36', '43'], ['03', '08', '41', '60', '87']]
Python demo

Is it possible to check a string comparing two regex then adding it to a dictionary?

Question
How can I run through the string so that when locationRegex condition is met it will add it's output to a dictionary, then add any subsequent numbers from numbersRegex to the same dictionary then create a new one with the next location arrives. As shown in Desired output.
Code
import re
# Text to check
text = "Italy Roma 20 40 10 4902520 10290" \
"Italy Milan 20 10 49 20 1030" \
"Germany Berlin 20 10 10 10 29 490" \
"Germany Frankfurt 20 0 0 0 0" \
"Luxemburg Luxemburg 20 10 49"
# regex to find location
locationRegex = re.compile(r'[A-Z]\w+\s[A-Z]\w+')
# regex to find numbers
numberRegex = re.compile(r'[0-9]+')
# Desired output
locations = {'Italy Roma': {'numbers': [10, 40, 10, 4902520]},
'Italy Milan': {'numbers': [20, 10, 49, 20, 1030]}}
What I have tried
I have ran the regex against the string with re.findall however I have the issue of assigning the numbers to the locations as they sit in two separate pots of locations and numbers.
Use a single regex to split the text in chunks, use groups within the regex to separate the data (note the parenthesis), and finally use split to split the number string on the spaces:
import re
text = (
"Italy Roma 20 40 10 4902520 10290"
"Italy Milan 20 10 49 20 1030"
"Germany Berlin 20 10 10 10 29 490"
"Germany Frankfurt 20 0 0 0 0"
"Luxemburg Luxemburg 20 10 49"
)
line_regex = re.compile(r"([A-Z]\w+\s[A-Z]\w+) ([0-9 ]+)")
loc_dict = {}
for match in re.finditer(line_regex, text):
print(match.group(1))
print(match.group(2))
loc_dict[match.group(1)] = {"numbers": match.group(2).split(" ")}
print(loc_dict)
The dict will be:
{'Italy Roma': {'numbers': ['20', '40', '10', '4902520', '10290']},
'Italy Milan': {'numbers': ['20', '10', '49', '20', '1030']},
'Germany Berlin': {'numbers': ['20', '10', '10', '10', '29', '490']},
'Germany Frankfurt': {'numbers': ['20', '0', '0', '0', '0']},
'Luxemburg Luxemburg': {'numbers': ['20', '10', '49']}}
Note that you should check for edge cases: no numbers, cities with a space in the name and so on.
Cheers!

How do i remove the quotes using my code?

Basically, can I know how to add quote_none to my code? I am using Python 3.7. I am trying to keep it as simple as possible.
csvFile = csv.reader(open("cats.csv",'r'))
header = next(csvFile)
index = 1
print (header)
print("")
for row in csvFile :
if row[1] >= "35" :
print (row)
index +=1
This is what I got:
1 ['Oliver', '12', 'HOPE']
2 ['Leo', '16', 'SPCA', '']
3 ['Milo', '13', 'ISPCA', ']
4 ['Jack', '12', 'SPCA', ']
5 ['George', '14', 'HOPE']
6 ['Bella', '10', 'FFF', ']
7 ['Cleo', '14', 'SPCA', '']
8 ['Nala', '16', 'ISPCA', '']
9 ['Teddy', '12', 'SPCA', ']
10 ['Zeus', '16', 'SPCA', '']
11 ['Louie', '14', 'LOVS', '']
12 ['Apollo', '11', 'FFF', '' ]
13 ['George', '10', 'SPCA', '']
14 ['Ziggy', '11', 'ISPCA', ']
Expected results:
1 ['Oliver,12,HOPE']
2 ['Leo,16,SPCA']
3 ['Charlie,18,SPCA']
4 ['Milo,13,ISPCA']
5 ['Jack,12,SPCA']
6 ['George,14']
7 ['Simon,22,SPCA']
8 ['Loki,24,SPCA']
9 ['Simba,23,SPCA']
Use join:
lst = ['Oliver', '12', 'HOPE']
print([','.join(lst)])
So, in your example I guess this would be:
print([','.join(row)])

Merge two Pandas Dataframes

I have the following merging problem:
I have a time series of industry related data: weekly Profit Margins for 60 different industries over multiple years, which looks like this:
industry = pd.DataFrame({'Ind0': ['01', '02', '03', '04'],
'Ind1': ['11', '12', '13', '14'],
'Ind2': ['21', '22', '23', '24'],
'Ind3': ['31', '32', '33', '34']})
My 2nd dataframe consists of a few 1,000 stocks and their respective industries (each stock belongs to exactly one industry)
stocks = pd.DataFrame({'Stock0': ['Ind0'],
'Stock1': ['Ind1'],
'Stock2': ['Ind2'],
'Stock3': ['Ind3'],
'Stock4': ['Ind0'],
'Stock5': ['Ind1']})
I would like to create a new dataframe that contains the industry time series for each stock coming from the correct industry that the stock belongs to, i.e. something like this:
result = pd.DataFrame({'Stock0': ['01', '02', '03', '04'],
'Stock1': ['11', '12', '13', '14'],
'Stock2': ['21', '22', '23', '24'],
'Stock3': ['31', '32', '33', '34'],
'Stock4': ['01', '02', '03', '04'],
'Stock5': ['11', '12', '13', '14']})
I have tried a number of merge/concatenate approaches without success. Any help is appreciated.
Is this what you want?
stocks.T.merge(industry.T,left_on=0,right_index=True).drop(['key_0','0_x'],axis=1).rename(columns={'0_y':0}).T
Out[189]:
Stock0 Stock4 Stock1 Stock5 Stock2 Stock3
0 01 01 11 11 21 31
1 02 02 12 12 22 32
2 03 03 13 13 23 33
3 04 04 14 14 24 34

Categories

Resources