Use regex to find and merge words in a string Python - python

I'm trying to find a way to match and merge the teams name from a string like below. I've tried few different ways with regex but was unsuccessful. few examples:
'30 Detroit Red Wings 12 47:06 3 8 1 3 7 0.292'
'31 Los Angeles Kings 11 47:45 4 7 0 4 8'
24 Anaheim Ducks 12 47:49 7 5 0 7 14 0.583
I want the output to look like this:
[30, 'Detroit Red Wings', 12, 47:06, 3, 8, 1, 3, 7, 0.292]
[24, 'Anaheim Ducks', 12, 47:49, 7, 5, 0, 7, 14, 0.583]
Here is what I tried with regex but with no success:
pattern = re.compile(r'\b\w+\b')
matches = pattern.finditer(i)

Here is an option using re.findall:
inp = '30 Detroit Red Wings 12 47:06 3 8 1 3 7 0.292'
matches = re.findall(r'\d+:\d+|\d+(?:\.\d+)?|[A-Za-z]+(?: [A-Za-z]+)*', inp)
print(matches)
This prints:
['30', 'Detroit Red Wings', '12', '47:06', '3', '8', '1', '3', '7', '0.292']
The regex pattern used matches either a time string, an integer/floating point number, or a series of letter-only words:
\d+:\d+ match a time string (e.g. '47:06')
| or
\d+(?:\.\d+)? match an integer/floating point number
| or
[A-Za-z]+(?: [A-Za-z]+)* match a series of words (e.g. Detroit Red Wings)

Related

I need to place a column of values into categorical bins

I am cleaning up a dataframe about apples. I am supposed to put the values from the "Age" column into categorical bins. However when I get to the part of actually placing the values into bins, labeling them, etc. Either all of my values end up in the first category (code 1), or it seems to drop every value that doesn't fit the first bin (code 2).
Try 1
import pandas as pd
data = {'Fav': ['Gala', 'Fuji', 'GALA', 'GRANNY SMITH', 'Red Delicious',
'All of them!', 'Pink lady', 'IDK', 'granny smith', 'Honey Crisp',
'Fuji', 'Golden delish', 'McIntosh', 'Empire', 'Gala' ],
'Age': [10,'Old enough', '30+', 'No', 21, 19, 43,37,29,7,28,70,60,52,49]}
apples = pd.DataFrame(data)
# create True/False index
B = apples['AGE'].str.isnumeric()
# for the index, fill missing values with False
B = B.fillna(False)
# select Age column for only those False values from index and code as missing
apples.loc[~B 'AGE'] = np.nan
#change strings to floats (when I test calue counts up to this point everything is there and correct)
apples.loc[~B,'AGE'] = apples.loc[~B,'Q3: AGE'].str.replace('\%', '', regex=True).astype(float)
#binning (at this point it puts ALL values in the "unknown" bin.)
bins = [0,1, 18, 26, 36, 46, 56]
labels = ['17 and under', '18-25', '26-35', '36-45', '46-55', '56+']
apples['AGE'] = pd.cut(B, bins, labels)
#check my result
apples['AGE'].value_counts()
Try 2
# create True/False index
B = apples['AGE'].str.isnumeric()
# for the index, fill missing values with False
B = B.fillna(False)
# select Age column for only those False values from index and code as missing
apples.loc[~B 'AGE'] = np.nan
#change strings to floats (when I test value counts up to this point everything is there and correct)
apples.loc[~B,'AGE'] = apples.loc[~B,'Q3: AGE'].str.replace('\%', '', regex=True).astype(float)
#binning (at this point I lose all values except the 'unknown" bin.)
apples['AGE']= pd.cut(~B, bins=[0,1,18, 26, 36, 46, 56, 100] ,
labels=['unknown','17 and under', '18-25', '26-35', '36-45', '46-55', '56+'])
#check my result
apples['AGE'].value_counts()
Any other way that I attempt to format this code gives me the type error "'<' not supported between instances of 'int' and 'str'"
Although not explicitly listed in the "categories", Nan values do have their own category in a categorical, as seen by their distinct code -1:
apples.Age = pd.to_numeric(apples.Age, errors='coerce')
bins = [1, 18, 26, 36, 46, 56, 100]
labels = ['17 and under', '18-25', '26-35', '36-45', '46-55', '56+']
apples.Age = pd.cut(apples.Age, bins=bins, labels=labels)
print(apples)
print(apples.Age.cat.codes)
# Output:
Fav Age
0 Gala 17 and under
1 Fuji NaN
2 GALA NaN
3 GRANNY SMITH NaN
4 Red Delicious 18-25
5 All of them! 18-25
6 Pink lady 36-45
7 IDK 36-45
8 granny smith 26-35
9 Honey Crisp 17 and under
10 Fuji 26-35
11 Golden delish 56+
12 McIntosh 56+
13 Empire 46-55
14 Gala 46-55
0 0
1 -1
2 -1
3 -1
4 1
5 1
6 3
7 3
8 2
9 0
10 2
11 5
12 5
13 4
14 4
dtype: int8

Separate values in a DataFrame column into a new columns depending on value

I have a DataFrame like below but much larger:
df = pd.DataFrame({'team': ['Mavs', 'Lakers', 'Spurs', 'Cavs', 'Mavs', 'Lakers', 'Spurs', 'Cavs'],
'name': ['Dirk', 'Kobe', 'Tim', 'Lebron', 'Kobe', 'Lebron', 'Tim', 'Lebron'],
'rebounds': [11, 7, 14, 7, 9, 5,7,12],
'points': [26, 31, 22, 29, 23, 56, 84, 23]})
I need to extract the different team names in the teams column into four separate columns retaining the other data in the row, so that the new DataFrame will have seven columns (Mavs, Lakers, Spurs, Cavs). I am sure this is really easy to do but despite looking into it I am at a loss as most of what I have found online involves splitting a string by a delimiter rather than a value.
A DataFrame that has seven columns (Mavs, Lakers, Spurs, Cavs, name, rebounds, points). The team columns that have been split out can just have the name of the team in so something like:
Mavs
Lakers
Spurs
Cavs
name
rebounds
points
Mavs
Dirk
11
26
Mavs
Kobe
9
23
Lakers
Kobe
7
31
Lakers
Lebron
5
56
and so on
Many thanks in advance. I would post an image but stack overflow doesn't seem to be letting me.
You can try pivot
df_ = df.pivot(index=['name', 'rebounds', 'points'], columns='team', values='team').reset_index().fillna('')
print(df_)
team name rebounds points Cavs Lakers Mavs Spurs
0 Dirk 11 26 Mavs
1 Kobe 7 31 Lakers
2 Kobe 9 23 Mavs
3 Lebron 5 56 Lakers
4 Lebron 7 29 Cavs
5 Lebron 12 23 Cavs
6 Tim 7 84 Spurs
7 Tim 14 22 Spurs
You can use get_dummies:
df2 = pd.get_dummies(df, columns=['team'], prefix='team')
It will fill in the columns with 1s and 0s.
If you really want the output that you say (with the name of the team, or empty), you can then iterate over those columns and replace the 1s and 0s with what you want:
for team in df['team'].unique():
df2['team_' + team].replace({0: '', 1: team}, inplace=True)

Is it possible to check a string comparing two regex then adding it to a dictionary?

Question
How can I run through the string so that when locationRegex condition is met it will add it's output to a dictionary, then add any subsequent numbers from numbersRegex to the same dictionary then create a new one with the next location arrives. As shown in Desired output.
Code
import re
# Text to check
text = "Italy Roma 20 40 10 4902520 10290" \
"Italy Milan 20 10 49 20 1030" \
"Germany Berlin 20 10 10 10 29 490" \
"Germany Frankfurt 20 0 0 0 0" \
"Luxemburg Luxemburg 20 10 49"
# regex to find location
locationRegex = re.compile(r'[A-Z]\w+\s[A-Z]\w+')
# regex to find numbers
numberRegex = re.compile(r'[0-9]+')
# Desired output
locations = {'Italy Roma': {'numbers': [10, 40, 10, 4902520]},
'Italy Milan': {'numbers': [20, 10, 49, 20, 1030]}}
What I have tried
I have ran the regex against the string with re.findall however I have the issue of assigning the numbers to the locations as they sit in two separate pots of locations and numbers.
Use a single regex to split the text in chunks, use groups within the regex to separate the data (note the parenthesis), and finally use split to split the number string on the spaces:
import re
text = (
"Italy Roma 20 40 10 4902520 10290"
"Italy Milan 20 10 49 20 1030"
"Germany Berlin 20 10 10 10 29 490"
"Germany Frankfurt 20 0 0 0 0"
"Luxemburg Luxemburg 20 10 49"
)
line_regex = re.compile(r"([A-Z]\w+\s[A-Z]\w+) ([0-9 ]+)")
loc_dict = {}
for match in re.finditer(line_regex, text):
print(match.group(1))
print(match.group(2))
loc_dict[match.group(1)] = {"numbers": match.group(2).split(" ")}
print(loc_dict)
The dict will be:
{'Italy Roma': {'numbers': ['20', '40', '10', '4902520', '10290']},
'Italy Milan': {'numbers': ['20', '10', '49', '20', '1030']},
'Germany Berlin': {'numbers': ['20', '10', '10', '10', '29', '490']},
'Germany Frankfurt': {'numbers': ['20', '0', '0', '0', '0']},
'Luxemburg Luxemburg': {'numbers': ['20', '10', '49']}}
Note that you should check for edge cases: no numbers, cities with a space in the name and so on.
Cheers!

splitting 3 space seperated values in a string from the end of the string?

i have this string :
"peter bull team tunnel rat 10 20 30"
What i would like to do is to extract the last 3 values from the end :
30
20
10
How can i strip these last 3 fields backwards in python the smartest way ?
One simple way would be with rsplit:
s = "peter bull team tunnel rat 10 20 30"
n = 3
out = s.rsplit(maxsplit=n)[-n:]
# ['10', '20', '30']
If you want a list of integers:
list(map(int, out))
# [10, 20, 30]
Following the comment, if you want to append the text before the last digits, one way would be:
s, *d = s.rsplit(sep=' ',maxsplit=3)
' '.join([*d, s])
# '10 20 30 peter bull team tunnel rat'
you can use split, and reversed() function to get values backwards:
data = "peter bull team tunnel rat 10 20 30"
print (list(reversed(data.split()[-3:])))
output:
['30', '20', '10']
Using split() with list comprehension
list.reverse() - method reverses the elements of a given list.
Ex.
sentence = "peter bull team tunnel rat 10 20 30"
num = [int(s) for s in sentence.split() if s.isdigit()]
num.reverse()
print(num)
O/P:
[30, 20, 10]

Python Sorting Contents of txt file

I have a function that opens a file called: "table1.txt" and outputs the comma separated values into a certain format.
My function is:
def sort_and_format():
contents = []
with open('table1.txt', 'r+') as f:
for line in f:
contents.append(line.split(','))
max_name_length = max([len(line[0]) for line in contents])
print(" Team Points Diff Goals \n")
print("--------------------------------------------------------------------------\n")
for i, line in enumerate(contents):
line = [el.replace('\n', '') for el in line]
print("{i:3} {0:{fill_width}} {1:3} {x:3} {2:3} :{3:3}".format(i=i+1, *line,
x = (int(line[2])- int(line[3])), fill_width=max_name_length))
I figured out how to format it correctly so for a "table1.txt file of:
FC Ingolstadt 04, 13, 4, 6
Hamburg, 9, 8, 10
SV Darmstadt 98, 9, 8, 9
Mainz, 9, 6, 9
FC Augsburg, 4, 7, 12
Werder Bremen, 6, 7, 12
Borussia Moenchengladbach, 6, 9, 15
Hoffenheim, 5, 8, 12
VfB Stuttgart, 4, 9, 17
Schalke 04, 16, 14, 3
Hannover 96, 2, 6, 18
Borrusia Dortmund, 16, 15, 4
Bayern Munich, 18, 18, 2
Bayer Leverkusen, 14, 11, 8
Eintracht Frankfurt, 9, 13, 9
Hertha BSC Berlin, 14, 5, 4
1. FC Cologne, 13, 10, 10
VfB Wolfsburg, 14, 10, 6
It would output:
Team Points Diff Goals
--------------------------------------------------------------------------
1 FC Ingolstadt 04 13 -2 4 : 6
2 Hamburg 9 -2 8 : 10
3 SV Darmstadt 98 9 -1 8 : 9
4 Mainz 9 -3 6 : 9
5 FC Augsburg 4 -5 7 : 12
6 Werder Bremen 6 -5 7 : 12
7 Borussia Moenchengladbach 6 -6 9 : 15
8 Hoffenheim 5 -4 8 : 12
9 VfB Stuttgart 4 -8 9 : 17
10 Schalke 04 16 11 14 : 3
11 Hannover 96 2 -12 6 : 18
12 Borrusia Dortmund 16 11 15 : 4
13 Bayern Munich 18 16 18 : 2
14 Bayer Leverkusen 14 3 11 : 8
15 Eintracht Frankfurt 9 4 13 : 9
16 Hertha BSC Berlin 14 1 5 : 4
17 1. FC Cologne 13 0 10 : 10
18 VfB Wolfsburg 14 4 10 : 6
I am trying to figure out how to sort the file so that the team with the highest points would be ranked number 1, and if a team has equal points then they are ranked by diff(the difference in goals for and against the team), and if the diff is the same they are ranked by goals scored.
I thought of implementing a bubble sort function similar to:
def bubble_sort(lst):
j = len(lst)
made_swap = True
swaps = 0
while made_swap:
made_swap = False
for cnt in range (j-1):
if lst[cnt] < lst[cnt+1]:
lst[cnt], lst[cnt+1] = lst[cnt+1], lst[cnt]
made_swap = True
swaps = swaps + 1
return swaps
But I do not know how to isolate each line and compare the values of each to one another to sort.
The following code will sort the list in the ways you asked:
from operator import itemgetter
def sort_and_format():
contents = []
with open('table1.txt', 'r+') as f:
for line in f:
l = line.split(',')
l[1:]=map(int,l[1:])
contents.append(l)
contents.sort(key=itemgetter(2))
contents.sort(key=lambda team:team[2]-team[3])
contents.sort(key=itemgetter(1))
[printing and formatting code]
What this does diferently:
First of all, it converts all the data about each team to numbers, excluding the name. This allows the later code to do math on them.
Then the first contents.sort statement sorts the list by goals scored (index 2). operator.itemgetter(2) is just a faster way to say lambda l:l[2]. The next contents.sort statement stably sorts the list by goals for minus goals against, as that is what the lambda does. Stable sorting means that the order of equally-compairing elements does not change, so teams with equal goal diff remain sorted by goals scored. The third contents.sort statement does the same stable sort by points.
contents = [row.strip('\n').split(', ') for row in open('table1.txt', 'r+')]
so that your rows look like:
['FC Ingolstadt 04', '13', '4', '6']
Then you can use Python's built-in sort function:
table = sorted(contents, key=lambda r: (int(r[1]), int(r[2])-int(r[3]), int(r[3])), reverse=True)
and print 'table' with the specific formatting you want.
I have joined spaces in the first column with _ to make life easier, so the data looks like:
F_ngolstad_4 13 -2 4:6
Hamburg 9 -2 8:10
S_armstad_8 9 -1 8:9
Mainz 9 -3 6:9
F_ugsburg 4 -5 7:12
Werde_remen 6 -5 7:12
Borussi_oenchengladbach 6 -6 9:15
Hoffenheim 5 -4 8:12
Vf_tuttgart 4 -8 9:17
Schalk_4 16 11 14:3
Hannove_6 2 -12 6:18
Borrusi_ortmund 16 11 15:4
Bayer_munich 18 16 18:2
Baye_everkusen 14 3 11:8
Eintrach_rankfurt 9 4 13:9
Herth_S_erlin 14 1 5:4
1._F_ologne 13 0 10:10
Vf_olfsburg 14 4 10:6
all_lines = []
with open('data', 'r') as f:
for line in f:
li = line.split()
all_lines.append(li)
l = sorted(all_lines,key=lambda x: (int(x[1]),int(x[2])),reverse=True)
for el in l:
print(el)
['Bayer_munich', '18', '16', '18:2']
['Schalk_4', '16', '11', '14:3']
['Borrusi_ortmund', '16', '11', '15:4']
['Vf_olfsburg', '14', '4', '10:6']
['Baye_everkusen', '14', '3', '11:8']
['Herth_S_erlin', '14', '1', '5:4']
['1._F_ologne', '13', '0', '10:10']
['F_ngolstad_4', '13', '-2', '4:6']
['Eintrach_rankfurt', '9', '4', '13:9']
['S_armstad_8', '9', '-1', '8:9']
['Hamburg', '9', '-2', '8:10']
['Mainz', '9', '-3', '6:9']
['Werde_remen', '6', '-5', '7:12']
['Borussi_oenchengladbach', '6', '-6', '9:15']
['Hoffenheim', '5', '-4', '8:12']
['F_ugsburg', '4', '-5', '7:12']
['Vf_tuttgart', '4', '-8', '9:17']
['Hannove_6', '2', '-12', '6:18']

Categories

Resources