I have the following STRING column on a pandas DataFrame.
HOURCENTSEG(string-column)
070026.16169
070026.16169
070026.16169
070026.16169
070052.85555
070052.85555
070109.43620
070202.56430
070202.56431
070202.56434
070202.56434
As you can see we have many elements where the time overlaps before the point, in all the fields to avoid date overlaps I must add an incremental counter as I show you in the following output example.
HOURCENTSEG (string-column)
070026.00001
070026.00002
070026.00003
070026.00004
070052.00001
070052.00002
070109.00001 (if there is only one value it's just 00001)
070202.00001
070202.00002
070202.00003
070202.00004
It is a poorly designed application in the past and I have no other option to solve this.
Summary: Add an incremental counter after point. With a maximum size of 5, and padded with 0 from the left, When the number to the left of the point is equal.
Use GroupBy.cumcount with splitted values by . and selected first sublist, last add zeros by Series.str.zfill:
s = df['HOURCENTSEG'].str.split('.').str[0]
df['HOURCENTSEG'] = s + '.' + s.groupby(s).cumcount().add(1).astype(str).str.zfill(5)
print (df)
HOURCENTSEG
0 070026.00001
1 070026.00002
2 070026.00003
3 070026.00004
4 070052.00001
5 070052.00002
6 070109.00001
7 070202.00001
8 070202.00002
9 070202.00003
10 070202.00004
Related
I need to locate the first location where the word 'then' appears on Words table. I'm trying to get a code to consolidate all strings on 'text' column from this location till the first text with a substring '666' or '999' in it (in this case a combination of their, stoma22, fe156, sligh334, pain666 (the desired subtrings_output = 'theirfe156sligh334pain666').
I've tried:
their_loc = np.where(words['text'].str.contains(r'their', na =True))[0][0]
666_999_loc = np.where(words['text'].str.contains(r'666', na =True))[0][0]
subtrings_output = Words['text'].loc[Words.index[their_loc:666_999_loc]]
as you can see I'm not sure how to extend the conditioning of 666_999_loc to include substring 666 or 999, also slicing the indexing between two variables renders an error. Many thanks
Words table:
page no
text
font
1
they
0
1
ate
0
1
apples
0
2
and
0
2
then
1
2
their
0
2
stoma22
0
2
fe156
1
2
sligh334
0
2
pain666
1
2
given
0
2
the
1
3
fruit
0
You just need to add one for the end of the slice, and add an or condition to the np.where of the 666_or_999_loc using the | operator.
text_col = words['text']
their_loc = np.where(text_col.str.contains(r'their', na=True))[0][0]
contains_666_or_999_loc = np.where(text_col.str.contains('666', na=True) |
text_col.str.contains('999', na=True))[0][0]
subtrings_output = ''.join(text_col.loc[words.index[their_loc:contains_666_or_999_loc + 1]])
print(subtrings_output)
Output:
theirstoma22fe156sligh334pain666
IIUC, use pandas.Series.idxmax with "".join().
Series.idxmax(axis=0, skipna=True, *args, **kwargs)
Return the row label of the maximum value.
If multiple values equal the maximum, the first row label with that
value is returned.
So, assuming (Words) is your dataframe, try this :
their_loc = Words["text"].str.contains("their").idxmax()
_666_999_loc = Words["text"].str.contains("666").idxmax()
subtrings_output = "".join(Words["text"].loc[Words.index[their_loc:_666_999_loc+1]])
Output :
print(subtrings_output)
#theirstoma22fe156sligh334pain666
#their stoma22 fe156 sligh334 pain666 # <- with " ".join()
This is an extension to a question I posted earlier: Python Sum lookup dynamic array table with df column
I'm currently investigating a way to efficiently map a decision variable to a dataframe. The main DF and the lookup table will be dynamic in length (+15,000 lines and +20 lines, respectively). Thus was hoping not to do this with a loop, but happy to hear suggestions.
The DF (DF1) will mostly look like the following, where I would like to lookup/search for the decision.
Where the decision value is found on a separate DF (DF0).
For Example: the first DF1["ValuesWhereXYcomefrom"] value is 6.915 which is between 3.8>=(value)>7.4 on the key table and thus the corresponding value DF0["Decision"] is -1. The process then repeats until every line is mapped to a decision.
I was thinking to use the python bisect library, but have not prevailed to any working solution and also working with a loop. Now I'm wondering if I am looking at the problem incorrectly as mapping and looping 15k lines is time consuming.
Example Main Data (DF1):
time
Value0
Value1
Value2
ValuesWhereXYcomefrom
Value_toSum
Decision Map
1
41.43
6.579482077
0.00531021
2
41.650002
6.756817908
46.72466411
6.915187703
0.001200456
-1
3
41.700001
6.221966706
11.64727001
1.871959552
0.000959257
-1
4
41.740002
6.230847055
46.92753343
7.531485368
0.006228989
1
5
42
6.637399856
8.031374656
1.210018204
0.010238095
-1
6
42.43
7.484894608
16.24547568
2.170434793
-0.007777563
-1
7
42.099998
7.595291765
38.73871244
5.100358702
0.003562993
-1
8
42.25
7.567457423
37.07538953
4.899319211
0.01088755
-1
9
42.709999
8.234795546
64.27986403
7.805884636
0.005151042
1
10
42.93
8.369526407
24.72700129
2.954408659
-0.003028209
-1
11
42.799999
8.146653099
61.52243361
7.55186613
0
1
Example KeyTable (DF0):
ValueX
ValueY
SUM
Decision
0.203627201
3.803627201
0.040294925
-1
3.803627201
7.403627201
0.031630668
-1
7.403627201
11.0036272
0.011841521
1
Here's how I would go about this, assuming your first DataFrame is called df and your second is decision:
def map_func(x):
for i in range(len(decision)):
try:
if x < decision["ValueY"].iloc[i]:
return decision["Decision"].iloc[i]
except Exception:
return np.nan
df["decision"] = df["ValuesWhereXYcomefrom"].apply(lambda x: map_func(x))
This will create a new row in your DataFrame called "decision" that contains the looked up value. You can then just query it:
df.decision.iloc[row]
Let us say I have the following simple data frame. But in reality, I have hundreds thousands of rows like this.
df
ID Sales
倀굖곾ꆹ譋῾理 100
倀굖곾ꆹ 50
倀굖곾ꆹ譋῾理 70
곾ꆹ텊躥㫆 60
My idea is that I want to replace the Chinese digit with randomly generated 8 digits something looks like below.
ID Sales
13434535 100
67894335 50
13434535 70
10986467 60
The digits are randomly generated but they should keep uniqueness as well. For example, row 0 and 2 are same and when it replaced by a random unique ID, it should be the same as well.
Can anyone help on this in Python pandas? Any solution that is already done before is also welcome.
The primary method here will be to use Series.map() on the 'ID's to assign the new values.
Used for substituting each value in a Series with another value, that may be derived from a function, a dict or a Series.
which is exactly what you're looking for.
Here are some options for generating the new IDs:
1. Randomly generated 8-digit integers, as asked
You can first create a map of randomly generated 8-digit integers with each of the unique ID's in the dataframe. Then use Series.map() on the 'ID's to assign the new values back. I've included a while loop to ensure that the generated ID's are unique.
import random
original_ids = df['ID'].unique()
while True:
new_ids = {id_: random.randint(10_000_000, 99_999_999) for id_ in original_ids}
if len(set(new_ids.values())) == len(original_ids):
# all the generated id's were unique
break
# otherwise this will repeat until they are
df['ID'] = df['ID'].map(new_ids)
Output:
ID Sales
0 91154173 100
1 27127403 50
2 91154173 70
3 55892778 60
Edit & Warning: The original ids are Chinese characters and they are already length 8. There's definitely more than 10 Chinese characters so with the wrong combination of original IDs, it could become impossible to make unique-enough 8-digit IDs for the new set. Unless you are memory bound, I'd recommend using 16-24 digits. Or even better...
2. Use UUIDs. [IDEAL]
You can still use the "integer" version of the ID instead of hex. This has the added benefit of not needing to check for uniqueness:
import uuid
original_ids = df['ID'].unique()
new_ids = {cid: uuid.uuid4().int for cid in original_ids}
df['ID'] = df['ID'].map(new_ids)
(If you are okay with hex id's, change uuid.uuid4().int above to uuid.uuid4().hex.)
Output:
ID Sales
0 10302456644733067873760508402841674050 100
1 99013251285361656191123600060539725783 50
2 10302456644733067873760508402841674050 70
3 112767087159616563475161054356643068804 60
2.B. Smaller numbers from UUIDs
If the ID generated above is too long, you could truncate it, with some minor risk. Here, I'm only using the first 16 hex characters and converting those to an int. You may put that in the uniqueness loop check as done for option 1, above.
import uuid
original_ids = df['ID'].unique()
DIGITS = 16 # number of hex digits of the UUID to use
new_ids = {cid: int(uuid.uuid4().hex[:DIGITS], base=16) for cid in original_ids}
df['ID'] = df['ID'].map(new_ids)
Output:
ID Sales
0 14173925717660158959 100
1 10599965012234224109 50
2 14173925717660158959 70
3 13414338319624454663 60
3. Creating a mapping based on the actual value:
This group of options has these advantages:
not needing a uniqueness check since it's deterministically based on the original ID and
So original IDs which were the same will generate the same new ID
doesn't need a map created in advance
3.A. CRC32
(Higher probability of finding a collision with different IDs, compared to option 2.B. above.)
import zlib
df['ID'] = df['ID'].map(lambda cid: zlib.crc32(bytes(cid, 'utf-8')))
Output:
ID Sales
0 2083453980 100
1 1445801542 50
2 2083453980 70
3 708870156 60
3.B. Python's built-in hash() of the orignal ID [My preferred approach in this scenario]
Can be done in one line, no imports needed
Reasonably secure to not generate collisions for IDs which are different
df['ID'] = df['ID'].map(hash)
Output:
ID Sales
0 4663892623205934004 100
1 1324266143210735079 50
2 4663892623205934004 70
3 6251873913398988390 60
3.C. MD5Sum, or anything from hashlib
Since the IDs are expected to be small (8 chars), even with MD5, the probability of a collision is very low.
import hashlib
DIGITS = 16 # number of hex digits of the hash to use
df['ID'] = df['ID'].str.encode('utf-8').map(lambda x: int(hashlib.md5(x).hexdigest()[:DIGITS], base=16))
Output:
ID Sales
0 17469287633857111608 100
1 4297816388092454656 50
2 17469287633857111608 70
3 11434864915351595420 60
Not very expert in Pandas, that's why implementing solution for you with Numpy + Pandas. As solution uses fast Numpy it means it will be much faster than pure Python solution especially if you have thousands of rows.
Try it online!
import pandas as pd, numpy as np
df = pd.DataFrame([
['倀굖곾ꆹ譋῾理', 100],
['倀굖곾ꆹ', 50],
['倀굖곾ꆹ譋῾理', 70],
['곾ꆹ텊躥㫆', 60],
], columns = ['ID', 'Sales'])
u, iv = np.unique(df.ID.values, return_inverse = True)
while True:
ids = np.random.randint(10 ** 7, 10 ** 8, u.size)
if np.all(np.unique(ids, return_counts = True)[1] <= 1):
break
df.ID = ids[iv]
print(df)
Output:
ID Sales
0 31043191 100
1 36168634 50
2 31043191 70
3 17162753 60
Given a dataframe df, create a list of the ids:
id_list = list(df.ID)
Then import the random package
from random import randint
from collections import deque
def idSetToNumber(id_list):
id_set = deque(set(id_list))
checked_numbers = []
while len(id_set)>0:
#get the id
id = randint(10000000,99999999)
#check if the id has been used
if id not in checked_numbers:
checked_numbers.append(id)
id_set.popleft()
return checked_numbers
This gives a list of unique 8-digit number for each of your keys.
Then create a dictionary
checked_numbers = idSetToNumber(id_list)
name2id = {}
for i in range(len(checked_numbers)):
name2id[id_list[i]]=checked_numbers[i]
Last step, replace all the pandas ID fields with the ones in the dictionary.
for i in range(df.shape[0]):
df.ID[i] = str(name2id[df.ID[i]])
I would:
identify the unique ID values
build (from np.random) an array of unique values of same size
build a tranformation dataframe with that array
use merge to replace the original ID values
Possible code:
trans = df[['ID']].drop_duplicates() # unique ID values
n = len(trans)
# np.random.seed(0) # uncomment for reproducible pseudo random sequences
while True:
# build a greater array to have a higher chance to get enough unique values
arr = np.unique(np.random.randint(10000000, 100000000, n + n // 2))
if len(arr) >= n:
arr = arr[:n] # ok keep only the required number
break
trans['new'] = arr # ok we have our transformation table
df['ID'] = df.merge(trans, how='left', on='ID')['new'] # done...
With your sample data (and with np.random.seed(0)), it gives:
ID Sales
0 12215104 100
1 48712131 50
2 12215104 70
3 70969723 60
Per #Arty's comment, np.unique will return a ascending sequence. If you do not want that, shuffle it before using it for the transformation table:
...
np.random.shuffle(arr)
trans['new'] = arr
...
I would like some help weeding out a bit of data within a very long strings within only one column. Within these long strings are bits of data separated by spaces. However, some extra characters (on certain row/indexes) are being found between needed information. I have put together a dataframe sample to show what is happening.
What is happening: The last line in code will cycle through the series and separate the first value between the beginning of string to the first space occurrence. Then move the value to a new_col. It is cycling through each row capturing '1' in some rows and random characters in other rows.
What I would like to have happen: Rows that begin with '1' leave alone. Rows that begin with anything other than '1' move to new column. This will allow me to realign my data and correct spaces until the next hiccup. Hopefully, I can take away what I learn here and press forward.
What I have tried to do: I have tried to breakdown the problem with a for loop that will identify the 1st position as not a '1' goto the else statement and call the line to split data at first space. I receive a ValueError: Columns must be same length as key error at the line after the else statement. I'm assuming that I'm trying to call to the dataframe column but the for loop is working from the variable "line". Just not sure.
import pandas as pd
from pandas import DataFrame, Series
file = {'Test': ['1 RTWXA LC WTHML',
'2TE7 1 RHRRA LY HGHRL',
'1 WJKFD LF LPKRF',
'ATEGR/5 1 WJFTA LC TRGEH',
'1 POKJD LD PLERT',
'4ARF/3 1 EDFRT TG IUYHT' ] }
df = pd.DataFrame(file)
# ----Questionable Code Starts-----
df[['new-col','Test']] = (df["Test"].str.extract("([^\s]*)\s?(1.*)"))
# ----Questionable Code Ends-------
# below line will split (using spaces) at one occurrence only.
# creates new column (misc_col) and places data within keeping row/index
# integrity.
# df[['new_col','Test']] = df.Test.str.split(" ", 1, expand=True)
print(df)
What's Happening: What I would like:
Test new_col Test new_col
0 RTWXA LC WTHML 1 0 1 RTWXA LC WTHML
1 1 RHRRA LY HGHRL 2TE7 1 1 RHRRA LY HGHRL 2TE7
2 WJKFD LF LPKRF 1 2 1 WJKFD LF LPKRF
3 1 WJFTA LC TRGEH ATEGR/5 3 1 WJFTA LC TRGEH ATEGR/5
4 POKJD LD PLERT 1 4 1 POKJD LD PLERT
5 1 EDFRT TG IUYHT 4ARF/3 5 1 EDFRT TG IUYHT 4ARF/3
As always, I appreciate any help you can provide to get me over this hump.
Thanks
You can use pd.Series.str.extract:
print (df["Test"].str.extract("([^\s]*)\s?(1.*)"))
0 1
0 1 RTWXA LC WTHML
1 2TE7 1 RHRRA LY HGHRL
2 1 WJKFD LF LPKRF
3 ATEGR/5 1 WJFTA LC TRGEH
4 1 POKJD LD PLERT
5 4ARF/3 1 EDFRT TG IUYHT
I've got a dataframe like below where columns in c01 represent the start time and c04 the end for time intervals:
c01 c04
1742 8.444991 14.022029
3786 29.91143 31.422439
3951 29.91143 31.145099
5402 37.81136 42.689595
8230 63.12394 65.34602
also a list like this (it's actually way longer):
8.522494
8.54471
8.578426
8.611193
8.644996
8.678053
8.710918
8.744901
8.777851
8.811053
8.844867
8.878389
8.912099
8.944729
8.977601
9.011232
9.04492
9.078157
9.111946
9.144788
9.177663
9.211054
9.245265
9.27805
9.311766
9.344647
9.377612
9.411709
I'd like to count how many elements in the list falls in the intervals shown by the dataframe, where I coded like this:
count = 0
for index, row in speech.iterrows():
count += gtls.count(lambda i : i in [row['c01'], row['c04']])
the file works as a whole but all 'count' turns out to be 0, would you please tell me where did I mess up?
I took the liberty of converting your list into a numpy array() (I called it arr). Then you can use the apply function to create your count column. Let's assume your dataframe is called df.
def get_count(row): #the logic for your summation is here
return np.sum([(row['c01'] < arr) & (row['c04'] >= arr)])
df['C_sum'] = df.apply(get_count, axis = 1)
print(df)
Output:
c01 c04 C_sum
0 8.444991 14.022029 28
1 29.911430 31.422439 0
2 29.911430 31.145099 0
3 37.811360 42.689595 0
4 63.123940 65.346020 0
You can also do the whole thing in one line using lambda:
df['C_sum'] = df.apply(lambda row: np.sum([(row['c01'] < arr) & (row['c04'] >= arr)]), axis = 1)
Welcome to Stack Overflow! The i in [row['c01'], row['c04']] doesn't do what you seem to think; it stands for checking whether element i can be found from the two-element list instead of checking the range between row['c01'] and row['c04']. For checking if a floating point number is within a range, use row['c01'] < i < row['c04'].