I have a dataframe df with ~450000 rows and 4 columns like "HK" as in the example:
df = pd.DataFrame(
{
"HK": [
"19000000-ac-;ghj-;qrs",
"19000000- abcd-",
"19000000 -abc;klm-",
"19000000 - abc-;",
"19000000 a-",
]
}
)
df.head()
| HK
| -------------
| 19000000-ac-;ghj-;qrs
| 19000000- abcd-
| 19000000 -abc-;klm-
| 19000000 - abc-;
| 19000000 a-
I always have 8 digits followed by a value. The digits and the value are separated through different forms of "-" (no whitespace inbetween digits and value, whitespace left, whitespace right, whitespace left and right or only a whitespace without a "-").
I would like to get a unified presentation whith "$digits$ - $value$" so that my column looks like this:
| HK
| -------------
| 19000000 - ac-;ghj-;qrs
| 19000000 - abcd-
| 19000000 - abc-;klm-
| 19000000 - abc-;
| 19000000 - a-
Using pd.Series.str.replace with a regular expression:
>>> df['HK'].str.replace(r'(?<=\d{8})[\s-]+(?=\w)', ' - ', regex=True)
0 19000000 - ac-;ghj-;qrs
1 19000000 - abcd-
2 19000000 - abc;klm-
3 19000000 - abc-;
4 19000000 - a-
Name: HK, dtype: object
Explaining the regular expression. There is a lookback (?<=\d{8}) requiring that there are eight digits immediately before the main section. The main section is [\s-]+ which requires one or more characters which are whitespace or hyphens. Then there is a lookahead (?=\w) requiring that immediately after this is a word character (in this case, something like a).
Related
I have two dataframes, and I want to mark the second one if the first one contains a pattern. Very large of rows (>10000's)
date | items
20100605 | apple is red
20110606 | orange is orange
20120607 | apple is green
B: shorter with a few hundred rows.
id | color
123 | is Red
234 | not orange
235 | is green
Result would be to flag all columns in B if pattern found in A, possibly adding a column to B like
B:
id | color | found
123 | is Red | true
234 | not orange | false
235 | is green | true
thinking of something like, dfB['found'] = dfB['color'].isin(dfA['items']) but don't see any way to ignore case. Also, with this approach it will change true to false. Don't want to change those which are already set true. Also, I believe it's inefficient to loop large dataframes more than once. Running through A once and marking B would be better way but not sure how to achieve that using isin(). Any other ways? Especially ignoring case sensitivity of pattern.
You can use something like this:
df2['check'] = df2['color'].apply(lambda x: True if any(x.casefold() in i.casefold() for i in df['items']) else False)
or you can use str.contains:
df2['check'] = df2['color'].str.contains('|'.join(df['items'].str.split(" ").str[1] + ' ' + df['items'].str.split(" ").str[2]),case=False)
#get second and third words
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 12 months ago.
Improve this question
I've got a list of games between teams that takes place over a sixteen day period:
| Date | Game |
|------|-----------------------------|
| 1 | hot ice vs playerz |
| 1 | caps vs quiet storm |
| 1 | slick ice vs blizzard |
| 1 | flow vs 4x4's |
| 2 | avalanche vs cold force |
| 2 | freeze vs in too deep |
| 2 | game spot vs rare air |
| 2 | out of order vs cold as ice |
| 3 | playerz vs avalanche |
| 3 | quiet storm vs freeze |
| 3 | blizzard vs game spot |
| 3 | 4x4's vs out of order |
| 14 | freeze vs avalanche |
| 14 | out of order vs game spot |
| 14 | in too deep vs cold force |
| 14 | cold as ice vs rare air |
| 15 | blizzard vs quiet storm |
| 15 | playerz vs 4x4's |
| 15 | slick ice vs caps |
| 15 | hot ice vs flow |
| 16 | game spot vs freeze |
| 16 | avalanche vs out of order |
| 16 | rare air vs in too deep |
| 16 | cold force vs cold as ice |
There are 16 teams that make up this schedule, and what I'd like to do in Python is find all of the 8 game combinations that allow me to "see" each team once. The only limitation is that I can only see one game per day. At this point all I can think of is a ton of nested for loops that generates all possible schedules, and then checking each one after to see if it is valid. A valid schedule is one that has one game per date and sees each team once.
You could use a backtracking algorithm to iterate through different combinations of matches and filtering them according to the constraints you mentioned.
First step would be to format your data into a collection like a python list or dict. Then implement a recursive backtracking algorithm that selects one match per day, and checks to make sure the chosen match doesn't include teams you have already selected.
Here is a rough example that uses the data you provided in your question:
def combinations(matches, day, schedules, current):
"""Backtracking function for selecting unique schedules."""
# base case when you have a match from each day
if day > max(matches.keys()):
schedules.append(current[:])
return
# skip over days where there are no matches
while day not in matches:
day += 1
# select one match for the current date
for i in range(len(matches[day])):
teams = matches[day][i]
current_teams = [j for i in current for j in i]
# check if the teams are already in the current schedule
if teams[0] in current_teams or teams[1] in current_teams:
continue
del matches[day][i]
# recursive case
combinations(matches, day + 1, schedules, current + [teams])
matches[day].insert(i,teams)
return
def format(inp):
"""Formats input data into a dictionary."""
lines = inp.split("\n")[2:] # split lines of input data
matches = [(line.split("|")[1:-1]) for line in lines]
schedule = {}
# add matches to dict with date as key and matches as value.
for day, match in matches:
day = int(day.strip())
teams = match.strip().split(" vs ")
try:
schedule[day].append(teams)
except KeyError:
schedule[day] = [teams]
ideal = []
# use backtracking algorithm to get desired results
combinations(schedule, 1, ideal, [])
show_schedules(ideal)
def show_schedules(results):
for i, x in enumerate(results):
print(f"Schedule {i+1}")
for day, match in enumerate(x):
print(f"Day: {day+1} - {match[0]} vs. {match[1]}")
print("\n")
format(inp) # <- entry point:`inp` is the pre-formatted data `str`
It's not exactly the most elegant code... :) With the example data this algorithm generates 32 unique schedules of 6 games. The output looks something like this but for each day of matches:
Schedule 1
Day: 1 - hot ice vs. playerz
Day: 2 - avalanche vs. cold force
Day: 3 - quiet storm vs. freeze
Day: 4 - out of order vs. game spot
Day: 5 - slick ice vs. caps
Day: 6 - rare air vs. in too deep
Schedule 2
Day: 1 - hot ice vs. playerz
Day: 2 - avalanche vs. cold force
Day: 3 - 4x4's vs. out of order
Day: 4 - cold as ice vs. rare air
Day: 5 - blizzard vs. quiet storm
Day: 6 - game spot vs. freeze
For more information on backtracking here are a few external resources or there are countless examples here on stack overflow.
https://www.hackerearth.com/practice/basic-programming/recursion/recursion-and-backtracking/tutorial/
http://jeffe.cs.illinois.edu/teaching/algorithms/book/02-backtracking.pdf
I'm trying to rationalize a quite scrambled phonebook xls of several thousandth of records. Some fields are kind of merged with other and/or saved into the wrong column, while other filed are splitted through 2 or more ones... and so on. I'm trying to find the path of the main error and solve those through regex, placing the right record into right column.
An example:
DataFrame as df:
id
Name
SecondName
Surname
Title
Company
01
Marc
Gigio
ETC ltd
02
Piero (Four
Season
Restaurant
)
03
bubbu(Caterpilar)
04
gaby(ts Inc)
05
Pit(REV inc)
REV Inc
06
Pluto
In record 01: would nothing to do, but see how manage conditional exception as point 5.
In record 02: merge Name + SecondName + Surname , then extract from new string the name (Piero) to place in Name column while extract from same string the content of squared bracket and place it into Company Column
df['Nameall_tmp'] = df[Name]+' '+df[SecondName]+' '+df[Surname]+' '+df[Company]
df['Name_tmp'] = df[Nameall_tmp].str.extract(r'(.+)(.+')
df['Company_tmp'] = df[Nameall_tmp].str.extract(r'.*((.+))')
In record 03 and 04: is almost 02
In record 06:
df['Nameall_tmp'] = df[Name]+' '+df[SecondName]+' '+df[Surname]+' '+df[Company]
df['Name_tmp'] = df['Nameall_tmp'].str.extract(r'(.+)(.+')
df['Name_tmp']= np.where(df['Name_tmp'] == 'nan' , df['Name'],df['Name_tmp'] )
In this case np.where statement doesn't work like if then else, in order to check if df['Name_tmp'] is "nan", in the case, fill with original df['Name'] to eliminate "nan" from record,else take df['Name_tmp']. Any sugestion ?
Rough thinking here:
munge the "company" column so that: if it contains a legit company name,
add () to it. If not, keep original content
concat all columns into one conglomerate column
use 1 regex to sr.str.extract(rex) that single conglomerate column into desired columns again
anyways, following the rough thinking, I have at least reduced the problem into fine tunning a single regex:
df = pd.DataFrame(
columns=" index Name SecondName Surname Company ".split(),
data= [
[ 0, "Marc", np.nan, "Gigio", "ETC ltd", ],
[ 1, "Piero", "(four", "season", "restaurant)", ],
[ 2, "bubbu(caterpilar)", np.nan, np.nan, np.nan, ],
[ 3, np.nan, np.nan, np.nan, "gaby(ts inc)", ],
[ 4, "Pit(REV inc)", np.nan, np.nan, "REV inc", ],
[ 5, "pluto", np.nan, np.nan, np.nan, ],]).set_index("index", drop=True)
df = df.fillna('')
df['Company'] = df['Company'].apply(lambda x: f'({x})' if ('(' not in x and ')' not in x and x!="") else x)
# df['sum'] = df.sum(axis=1)
df['sum'] = df['Name'] + ' ' + df['SecondName'] + ' ' + df['Surname'] + ' ' + df['Company']
df['sum'] = df['sum'].str.replace(r'\s+', ' ', regex=True) # get rid of extra \s due to above concat
rex = re.compile( # very fragil and hardcoded,
r"""
(?P<name0>[a-z]{2,})
\s?
(?P<surename0>[a-z]{2,})?
\s?
\(?
(?P<company0>[a-z\s]{3,})?
\)?
\s?
""",
re.X+re.I
)
df['sum'].str.extract(rex)
output:
+---------+---------+-------------+------------------------+
| index | name0 | surename0 | company0 |
|---------+---------+-------------+------------------------|
| 0 | Marc | Gigio | ETC ltd |
| 1 | Piero | nan | four season restaurant |
| 2 | bubbu | nan | caterpilar |
| 3 | gaby | nan | ts inc |
| 4 | Pit | nan | REV inc |
| 5 | pluto | nan | nan |
+---------+---------+-------------+------------------------+
EDIT:
Earlier answer contains an error in my regex (forgot to ? the \(), couldn't quite handle "pluto", corrected now.
The moral of the story is that, the regex you need to design will be very very specialized, fragil and hardcoded. almost worth considering a df['sum'].apply(myfoo) approach just to parse df['sum'] more thoroughly.
I have a pandas data frame that consists of 4 rows, the English rows contain news titles, some rows contain non-English words like this one
**She’s the Hollywood Power Behind Those ...**
I want to remove all rows like this one, so all rows that contain at least non-English characters in the Pandas data frame.
If using Python >= 3.7:
df[df['col'].map(lambda x: x.isascii())]
where col is your target column.
Data:
df = pd.DataFrame({
'colA': ['**She’s the Hollywood Power Behind Those ...**',
'Hello, world!', 'Cainã', 'another value', 'test123*', 'âbc']
})
print(df.to_markdown())
| | colA |
|---:|:------------------------------------------------------|
| 0 | **She’s the Hollywood Power Behind Those ...** |
| 1 | Hello, world! |
| 2 | Cainã |
| 3 | another value |
| 4 | test123* |
| 5 | âbc |
Identifying and filtering strings with non-English characters (see the ASCII printable characters):
df[df.colA.map(lambda x: x.isascii())]
Output:
colA
1 Hello, world!
3 another value
4 test123*
Original approach was to use a user-defined function like this:
def is_ascii(s):
try:
s.encode(encoding='utf-8').decode('ascii')
except UnicodeDecodeError:
return False
else:
return True
You can use regex to do that.
Installation documentation is here. (just a simple pip install regex)
import re
and use [^a-zA-Z] to filter it.
to break it down:
^: Not
a-z: small letter
A-Z: Capital letters
I am attempting to make a Caesar cipher that changes the key each letter, I currently have a working cipher that scrambles the entire string once, running 1-25 however I would like it to do it for each letter, as in the string "ABC" would shift A by 1, B by 2 and C by 3, resulting in BDF
I already have a working cipher, and am just not sure how to have it change each letter.
upper = collections.deque(string.ascii_uppercase)
lower = collections.deque(string.ascii_lowercase)
upper.rotate(number_to_rotate_by)
lower.rotate(number_to_rotate_by)
upper = ''.join(list(upper))
lower = ''.join(list(lower))
return rotate_string.translate(str.maketrans(string.ascii_uppercase, upper)).translate(str.maketrans(string.ascii_lowercase, lower))
#print (caesar("This is simple", 2))
our_string = "ABC"
for i in range(len(string.ascii_uppercase)):
print (i, "|", caesar(our_string, i))
Outcome is this:
0 | ABC
1 | ZAB
2 | YZA
3 | XYZ
4 | WXY
5 | VWX
6 | UVW
7 | TUV
8 | STU
9 | RST
10 | QRS
11 | PQR
12 | OPQ
13 | NOP
14 | MNO
15 | LMN
16 | KLM
17 | JKL
18 | IJK
19 | HIJ
20 | GHI
21 | FGH
22 | EFG
23 | DEF
24 | CDE
25 | BCD
What I would like is to have it a shift of 1 or 0 for the first letter, then 2 for the second, and so on.
Good effort! Note that the mapping doesn't only rearrange letters in the alphabet, so it's never achieved by rotating the alphabet. In your example, upper would become the following mapping:
ABCDEFGHIJKLMNOPQRSTUVWXYZ
BDFHJLNPRTVXZBDFHJLNPRTVXZ
Also note this cipher is not easily reversible, i.e. it's not clear whether to reverse 'B'->'A' or 'B'->'N'.
(Side note: If we treat letters ZABCDEFGHIJKLMNOPQRSTUVWXY as numbers 0-25, this cipher multiplies by two (in modulo 26): (x*2)%26. If instead of 2, we multiply by any number not divisible by 2 and 13, the resulting cipher will always be reversible. Can you see why? Hints: [1], [2].)
When you feel confused about a piece of code, often it's a good sign it's time to refactor a part of it into a separate function, e.g. like this:
(Playground: https://ideone.com/wNSADR)
import string
def letter_index(letter):
"""Determines the position of the given letter in the English alphabet
'a' -> 0
'A' -> 0
'z' -> 25
"""
if letter not in string.ascii_letters:
raise ValueError("The argument must be an English letter")
if letter in string.ascii_lowercase:
return ord(letter) - ord('a')
return ord(letter) - ord('A')
def caesar(s):
"""Ciphers the string s by shifting 'A'->'B', 'B'->'D', 'C'->'E', etc
The shift is cyclic, i.e. 'A' comes after 'Z'.
"""
ret = ""
for letter in s:
index = letter_index(letter)
new_index = 2*index + 1
if new_index >= len(string.ascii_lowercase):
# The letter is shifted farther than 'Z'
new_index %= len(string.ascii_lowercase)
new_letter = chr(ord(letter) - index + new_index)
ret += new_letter
return ret
print('caesar("ABC"):', caesar("ABC"))
print('caesar("abc"):', caesar("abc"))
print('caesar("XYZ"):', caesar("XYZ"))
Output:
caesar("ABC"): BDF
caesar("abc"): bdf
caesar("XYZ"): VXZ
Resources:
chr
ord