I have unique values in a column, but they all have strange codes, and I want to instead have a numeric counter to identify these values. Is there a better way to do this?
class umm:
inc = 0
last_val = ''
#classmethod
def create_new_index(cls, new_val):
if new_val != cls.last_val:
cls.inc += 1
cls.last_val = new_val
return cls.inc
df['Doc_ID_index'] = df['Doc_ID'].apply(lambda x: umm.create_new_index(x))
Here is the dataframe:
Doc_ID Sent_ID Doc_ID_index
0 PMC2774701 S1.1 1
1 PMC2774701 S1.2 1
2 PMC2774701 S1.3 1
3 PMC2774701 S1.4 1
4 PMC2774701 S1.5 1
... ... ... ...
46019 3469-0 3469-51 6279
46020 3528-0 3528-10 6280
46021 3942-0 3942-39 6281
46022 4384-0 4384-25 6282
46023 4622-0 4622-45 6283
Method 1
#take the unique Doc ID's in the column
new_df=pd.DataFrame({'Doc_ID':df['Doc_ID'].unique()})
#assign a unique id
new_df['Doc_ID_index'] = new_df.index +1
#combine with original df to get the whole df
pd.merge(df,new_df,on='Doc_ID')
Method 2
df['Doc_ID_index'] = df.groupby(['Doc_ID']).ngroup()
I hope this helps!
Related
I want to add a new column called I have a pandas dataframe called week5_233C. My Python version is 3.19.13.
I wrote an if-statement to add a new column to my data set: Spike. If the value in Value [pV] is not equal to 0, I want to add a 1 to that row. If Value [pV] is 0, then I want to add in the spike column that it is 0.
The data looks like this:
TimeStamp [µs] Value [pV]
0 1906200 0
1 1906300 0
2 1906400 0
3 1906500 -149012
4 1906600 -149012
And I want it to look like this:
TimeStamp [µs] Value [pV] Spike
0 1906200 0 0
1 1906300 0 0
2 1906400 0 0
3 1906500 -149012 1
4 1906600 -149012 1
I tried:
week5_233C.loc[week5_233C[' Value [pV]'] != 0, 'Spike'] = 1
week5_233C.loc[week5_233C[' Value [pV]'] == 0, 'Spike'] = 0
but all rows in column Spike get the same value.
I also tried:
week5_233C['Spike'] = week5_233C[' Value [pV]'].apply(lambda x: 0 if x == 0 else 1)
Again, it just adds only 0s or only 1s, but does not work with if and else. See example data:
TimeStamp [µs] Value [pV] Spike
0 1906200 0 1
1 1906300 0 1
2 1906400 0 1
3 1906500 -149012 1
4 1906600 -149012 1
Doing it like this:
for i in week5_233C[' Value [pV]']:
if i != 0:
week5_233C['Spike'] = 1
elif i == 0:
week5_233C['Spike'] = 0
does not do anything: does not add a column, does not give an error, and makes Python crash.
However, when I run this if-statement with just a print as such:
for i in week5_233C[' Value [pV]']:
if i != 0:
print(1)
elif i == 0:
print(0)
then it does print the exact values I want. I cannot figure out how to save these values in a new column.
This:
for i in week5_233C[' Value [pV]']:
if i != 0:
week5_233C.concat([1, df.iloc['Spike']])
elif i == 0:
week5_233C.concat([0, df.iloc['Spike']])
gives me an error: AttributeError: 'DataFrame' object has no attribute 'concat'
How can I make a new column Spike and add the values 0 and 1 based on the value in column Value [pV]?
I think you should check the dtype of Value [pV] column. You probably have string that's why you have the same value. Try print(df['Value [pV]'].dtype). If object try to convert with astype(float) or pd.to_numeric(df['Value [pV]']).
You can also try:
df['spike'] = np.where(df['Value [pV]'] == '0', 0, 1)
Update
To show bad rows and debug your datafame, use the following code:
df.loc[pd.to_numeric(df['Value [pV]'], errors='coerce').isna(), 'Value [pV]']
import pandas as pd
df = pd.DataFrame({'TimeStamp [µs]':[1906200, 1906300, 1906400, 1906500, 1906600],
'Value [pV] ':[0, 0, 0, -149012, -149012],
})
df['Spike'] = df.agg({'Value [pV] ': lambda v: int(bool(v))})
print(df)
TimeStamp [µs] Value [pV] Spike
0 1906200 0 0
1 1906300 0 0
2 1906400 0 0
3 1906500 -149012 1
4 1906600 -149012 1
I have a DF like the below:
I want to groupby price, count the number of occurrences where action == N / U / D for each price.
ID,Action,indicator,side, price, quantity
7930249,U,0,A,132.938,23
7930251,D,0,B,132.906,2
7930251,N,1,B,132.891,36
7930251,U,0,A,132.938,22
7930252,U,0,A,132.938,2
7930252,U,1,A,132.953,39
7930252,U,2,A,132.969,17
7930514,U,0,B,132.906,1
7930514,U,0,A,132.938,8
7930514,U,1,A,132.953,38
7930515,U,0,A,132.938,18
7930515,U,2,A,132.969,7
7930516,U,1,B,132.891,37
7930516,U,0,A,132.938,28
Current code:
pricelist = []
column_names = ['Price', 'N', 'U', 'D']
df_counter = pd.DataFrame(columns = column_names)
for name, group in df.groupby('Price'):
price = name
if price not in pricelist:
pricelist.append(price)
n_count = group['Action'][group['Action']=='N'].count()
u_count = group['Action'][group['Action']=='U'].count()
d_count = group['Action'][group['Action']=='D'].count()
dflist = [price, n_count, u_count, d_count]
price_dict = {'Price':price,'N':n_count, 'U':u_count,'D':d_count}
df1 = pd.DataFrame([price_dict], columns=price_dict.keys())
result = df_counter.append(df1)
continue
else:
continue
Returning:
Price N U D
0 136.938 1 0 0
Why is it not creating a longer data frame? I basically have the result of I print out the price_dict, however, am struggling to save it to a Dataframe.
IIUC, try using pd.crosstab, instead of coding your own method:
pd.crosstab(df['price'], df['Action'])
Output:
Action D N U
price
132.891 0 1 1
132.906 1 0 1
132.938 0 0 6
132.953 0 0 2
132.969 0 0 2
I have a column "Employees" that contains the following data:
122.12 (Mark/Jen)
32.11 (John/Albert)
29.1 (Jo/Lian)
I need to count how many values match a specific condition (like x>31).
base = list()
count = 0
count2 = 0
for element in data['Employees']:
base.append(element.split(' ')[0])
if base > 31:
count= count +1
else
count2 = count2 +1
print(count)
print(count2)
The output should tell me that count value is 2, and count2 value is 1. The problem is that I cannot compare float to list. How can I make that if work ?
You have a df with a Employees column that you need to split into number and text, keep the number and convert it into a float, then filter it based on a value:
import pandas as pd
df = pd.DataFrame({'Employees': ["122.12 (Mark/Jen)", "32.11(John/Albert)",
"29.1(Jo/Lian)"]})
print(df)
# split at (
df["value"] = df["Employees"].str.split("(")
# convert to float
df["value"] = pd.to_numeric(df["value"].str[0])
print(df)
# filter it into 2 series
smaller = df["value"] < 31
remainder = df["value"] > 30
print(smaller)
print(remainder)
# counts
smaller31 = sum(smaller) # True == 1 -> sum([True,False,False]) == 1
bigger30 = sum(remainder)
print(f"Smaller: {smaller31} bigger30: {bigger30}")
Output:
# df
Employees
0 122.12 (Mark/Jen)
1 32.11(John/Albert)
2 29.1(Jo/Lian)
# after split/to_numeric
Employees value
0 122.12 (Mark/Jen) 122.12
1 32.11(John/Albert) 32.11
2 29.1(Jo/Lian) 29.10
# smaller
0 False
1 False
2 True
Name: value, dtype: bool
# remainder
0 True
1 True
2 False
Name: value, dtype: bool
# counted
Smaller: 1 bigger30: 2
We have this function:
def GetPricePerCustomAmt(CustomAmt):
data = [{"Price":281.48,"USDamt":104.84},{"Price":281.44,"USDamt":5140.77},{"Price":281.42,"USDamt":10072.24},{"Price":281.39,"USDamt":15773.83},{"Price":281.33,"USDamt":19314.54},{"Price":281.27,"USDamt":22255.55},{"Price":281.2,"USDamt":23427.64},{"Price":281.13,"USDamt":23708.77},{"Price":281.1,"USDamt":23738.77},{"Price":281.08,"USDamt":24019.88},{"Price":281.01,"USDamt":25986.95},{"Price":281.0,"USDamt":26127.45}]
df = pd.DataFrame(data)
df["getfirst"] = np.where(df["USDamt"] > CustomAmt, 1, 0)
wantedprice = "??"
print(df)
print()
print("Wanted Price:",wantedprice)
return wantedprice
Calling it using a custom USDamt like this:
GetPricePerCustomAmt(500)
gets this result:
Price USDamt getfirst
0 281.48 104.84 0
1 281.44 5140.77 1
2 281.42 10072.24 1
3 281.39 15773.83 1
4 281.33 19314.54 1
5 281.27 22255.55 1
6 281.20 23427.64 1
7 281.13 23708.77 1
8 281.10 23738.77 1
9 281.08 24019.88 1
10 281.01 25986.95 1
11 281.00 26127.45 1
Wanted Price: ??
We want to return the Price row of the first 1 appearing in the "getfirst" column.
Examples:
GetPricePerCustomAmt(500)
Wanted Price: 281.44
GetPricePerCustomAmt(15000)
Wanted Price: 281.39
GetPricePerCustomAmt(24000)
Wanted Price: 281.08
How do we do it?
(If you know a more efficient way to get the wanted price please do tell too)
Use next with iter for return default value if no value matched and is returneded empty Series, for filtering use boolean indexing:
def GetPricePerCustomAmt(CustomAmt):
data = [{"Price":281.48,"USDamt":104.84},{"Price":281.44,"USDamt":5140.77},{"Price":281.42,"USDamt":10072.24},{"Price":281.39,"USDamt":15773.83},{"Price":281.33,"USDamt":19314.54},{"Price":281.27,"USDamt":22255.55},{"Price":281.2,"USDamt":23427.64},{"Price":281.13,"USDamt":23708.77},{"Price":281.1,"USDamt":23738.77},{"Price":281.08,"USDamt":24019.88},{"Price":281.01,"USDamt":25986.95},{"Price":281.0,"USDamt":26127.45}]
df = pd.DataFrame(data)
return next(iter(df.loc[df["USDamt"] > CustomAmt, 'Price']), 'no matched')
print(GetPricePerCustomAmt(500))
281.44
print(GetPricePerCustomAmt(15000))
281.39
print(GetPricePerCustomAmt(24000))
281.08
print(GetPricePerCustomAmt(100000))
no matched
I have to verify the "number_id" in a column that contains a list of ids from table 1 and create a new column with a list of control number from the table 2.
I am doing:
import pandas as pd
table_1 = pd.read_excel('path/file.xlsx', sheet_name="sheet 1")
table_2 = pd.read_excel('path/file.xlsx', dtype='str')
table_1[['Number_id_table_1']].head(5)
Number_id_table_1
0 [35904690, 20344131]
1 [26360006]
2 NaN
3 [46780790]
4 [355343]
table_2.head()
control account_id_nk
0 71996761124 10197651
1 49991227097 1263884
2 71981020953 876828
3 11964723845 35661849
4 47992004868 19071134
To campare the values and add the control number I am doing:
from itertools import chain
def mapping_account_id(index, original_df, column_id_name = str()):
original_index = index
list_column_id = []
if original_index in original_df:
for ind in original_index:
list_column_id.append(original_df.iloc[original_index][column_id_name])
return list(set(list(chain(*list_column_id))))
else:
return None
table_1 ['Number_id_table_1_teste'] = table_1 ['Number_id_table_1'].apply(mapping_account_id, args = (table_2, 'control'))
The result is "None" for every row. But i know that the values exist in the table.
Number_id_table_1_teste
0 None
1 None
2 None
3 None
4 None
I expected the column "Number_id_table_1_teste" to contain the control number for each number_id.
Number_id_table_1_teste
0 [21964258763, 81999403136]
1 [92993930352]
2 NaN
3 [17996018821]
4 [85988943884]