I have a data frame that has these values in one of its columns:
In:
df.line.unique()
Out:
array(['Line71A', 'Line71B', 'Line75B', 'Line79A', 'Line79B', 'Line75A', 'Line74A', 'Line74B',
'Line70A', 'Line70B', 'Line58B', 'Line70', 'Line71', 'Line74', 'Line75', 'Line79', 'Line58'],
dtype=object)
And I would like to create a new column with 2 values based on if the value string contains LineXX, like so:
if (df.line.str.contains("Line70") or (df.line.str.contains("Line71") or (df.line.str.contains("Line79")):
return 1
else:
return 0
So the value should be 1 in the new column, box_type, if the values in df.line contains "Line70", "Line71", "Line79" and the rest should be 0
I tried doing this with this code:
df['box_type'] = df.line.apply(lambda x: 1 if x.contains('Line70') or x.contains('Line71') or x.contains('Line79') else 0)
But I get this error:
AttributeError: 'str' object has no attribute 'contains'
And I tried adding .str in between x and contains, like x.str.contains(), but that also gave an error.
How can I do this?
Thanks!
How about:
df['box_type'] = df.line.str.contains('70|71|79')
Sample data:
np.random.seed(1)
df = pd.DataFrame({'line':np.random.choice(a, 10)})
Output:
line box_type
0 Line75A False
1 Line70 True
2 Line71 True
3 Line70A True
4 Line70B True
5 Line70 True
6 Line75A False
7 Line79 True
8 Line71A True
9 Line58 False
Related
I'm trying to run this piece of code:
df['ID'] =df.groupby(["Code","Number"]).apply(lambda x: x['O'].isin(x['D']) | x['D'].isin(x['O']) & (x['O'] != x['D'])).values
with the following input:
data1 ={"Code":["A","A","A"], "Number":[7,7,7],"O":["BR","AC","BR"],"D":["AC","LF","LF"]}
df=pd.DataFrame(data1)
I get the following error, if I have only one group (on Code & Number) in the input data frame:
data = array([[ True, True, False]])
index = Int64Index([0, 1, 2], dtype='int64')
ValueError: Length of values (1) does not match length of index (3)
If I use another input with multiple rows and groups, I don't get any errors. I don't really understand what's the problem and how can I fix it.
You have the error because there is a single group.
Example (using a function for clarity):
def f(x):
out = x['O'].isin(x['D']) | x['D'].isin(x['O']) & (x['O'] != x['D'])
# print(out) # uncomment to see how the groups are handled
return out
data1 ={"Code":["A","A","A"], "Number":[7,7,7],
"O":["BR","AC","BR"],"D":["AC","LF","LF"]}
df1 = pd.DataFrame(data1)
df1.groupby(["Code","Number"]).apply(f)
0 1 2
Code Number
A 7 True True False
Now let's add another group:
data2 = {"Code":list('AAABBB'), "Number":[7,7,7,8,8,8],
"O":["BR","AC","BR","BR","AC","BR"],"D":["AC","LF","LF","BR","AC","LF"]}
df2 = pd.DataFrame(data2)
df2.groupby(["Code","Number"]).apply(f)
Code Number
A 7 0 True
1 True
2 False
B 8 3 True
4 True
5 True
dtype: bool
You can "fix" the first output with stack:
df1.groupby(["Code","Number"]).apply(f).stack()
Code Number
A 7 0 True
1 True
2 False
dtype: bool
Well from what i could see this line of code return the following output
df.groupby(["Code","Number"]).apply(lambda x: x['O'].isin(x['D']) | x['D'].isin(x['O']) & (x['O'] != x['D'])).values
[[ True False True]]
This is actually in the following shape (1,3) if you convert it into numpy or series(which is gonna happen when you ran the following line df['Id'] = your_code) the thing is pandas gives you that error because your output returns a kinda of crooked shaped list. So all you need to do is convert it into numpy and reshape it like this.
Id = df.groupby(["Code","Number"]).apply(lambda x: x['O'].isin(x['D']) | x['D'].isin(x['O']) & (x['O'] != x['D'])).values
df['Id'] = np.reshape(np.array(Id),(3,1))
I am not sure if this is gonna run with your full dataset, but hey at least you can run when you have one sole row
I want to extract two first symbols in case three first symbols match a certain pattern (first two symbols should be any of those inside the brackets [ptkbdgG_fvsSxzZhmnNJlrwj], the third symbol should be any of those inside the brackets[IEAOYye|aouKLM#)3*<!(#0~q^LMOEK].
The first two lines work correctly.
The last lines do not work and I do not understand why. The code doesn`t give any errors, it just does nothing for those
# extract tree first symbols and save them in the new column
df['first_three_symbols'] = df['ITEM'].str[0:3]
#create a boolean column on condition whether first three symbols contain symbols
df["ccv"] = df["first_three_symbols"].str.contains('[ptkbdgG_fvsSxzZhmnNJlrwj][ptkbdgG_fvsSxzZhmnNJlrwj][IEAOYye|aouKLM#)3*<!(#0~q^LMOEK]')
#create another column for True values in the previous column
if df["ccv"].item == True:
df['first_two_symbols'] = df["ITEM"].str[0:2]
Here is my output:
ID ITEM FREQ first_three_symbols ccv
0 0 a 563 a False
1 1 OlrMndmEn 1 Olr False
2 2 OlrMndSpOrtl#r 0 Olr False
3 3 AG#l 74 AG# False
4 4 AG#lbMm 24 AG# False
... ... ... ... ... ...
51723 51723 zytzWt# 8 zyt False
51724 51724 zytzytOst 0 zyt False
51725 51725 zYxtIx 5 zYx False
51726 51726 zYxtIxkWt 0 zYx False
51727 51727 zyZe 4 zyZ False
[51728 rows x 5 columns]
you can either create a function, use apply method :
def f(row):
if row["ccv"] == True:
return row["ITEM"].str[0:2]
else:
return None
df['first_two_symbols'] = df.apply(f,axis=1)
or you can use np.wherefunction from numpy package.
I have two csv files, and I load that data into two different data frames:
CGMData_DF = pd.read_csv("./CGMData.csv", index_col=False, usecols=["Index","Date", "Time", "Sensor Glucose (mg/dL)"])
InsulinData_DF = pd.read_csv("./InsulinData.csv", index_col=False, usecols=["Index","Date","Time","Alarm"])
I want to check duplicates in column named "Index" in both the data frames
so for the CGM data I do this:
duplicateCGMindexes = CGMData_DF.duplicated(subset=["Index"])
duplicateCGMindexes[duplicateCGMindexes == True]
Jupyter notebook returns:
24628 True
24629 True
24630 True
24631 True
24632 True
...
47545 True
47546 True
47547 True
47548 True
47549 True
Length: 22838, dtype: bool
I take the first value and check using this:
CGMData_DF.loc[CGMData_DF['Index'] == 24628]
and sure enough, Jupyter notebook tells me that there are two rows:
Index Date Time Sensor Glucose (mg/dL)
4273 24628 1/28/2018 16:17:34 NaN
27111 24628 10/31/2017 21:08:59 261.0
I repeat the same process for the Insulin data frame
duplicateInsulinIndexes = InsulinData_DF.duplicated(subset=["Index"])
duplicateInsulinIndexes[duplicateInsulinIndexes == True]
Jupyter notebook returns:
19295 True
19296 True
19297 True
19298 True
19299 True
...
38585 True
38586 True
38587 True
38588 True
38589 True
Length: 19295, dtype: bool
I take the first value and check using this:
InsulinData_DF.loc[InsulinData_DF['Index'] == 19295]
Jupyter notebook returns:
Index Date Time Alarm
38590 19295 8/15/2017 22:24:13 NaN
Upon inspection I realize that 19295 is not the column value that is duplicated, but the row label of the column value that was duplicated
I get the "Index" column value for row label 19295:
InsulinData_DF.loc[19295]
Jupyter return:
Index 0
Date 11/9/2017
Time 12:23:04
Alarm NaN
Name: 19295, dtype: object
I check for "Index" column value 0
InsulinData_DF.loc[InsulinData_DF['Index'] == 0]
and Jupyter return 2 rows:
Index Date Time Alarm
0 0 2/12/2018 13:20:53 NaN
19295 0 11/9/2017 12:23:04 NaN
My question is Why did .duplicated() function returned column values in one case and it returned row label in another case?
When you remove it check to assign it back
uniqueDF = InsulinData_DF[duplicateInsulinIndexes]
OK, I realized that the
duplicateCGMindexes = CGMData_DF.duplicated(subset=["Index"])
duplicateCGMindexes[duplicateCGMindexes == True]
was infact returning row labels only, I mistook them for column values
CGMData_DF.loc[24628]
gave:
Index 22145
Date 11/9/2017
Time 12:24:33
Sensor Glucose (mg/dL) 117
Name: 24628, dtype: object
and
CGMData_DF.loc[CGMData_DF['Index'] == 22145]
gave:
Index Date Time Sensor Glucose (mg/dL)
1790 22145 2/6/2018 7:25:54 198.0
24628 22145 11/9/2017 12:24:33 117.0
infact column value 22145 was the first instance when the values again start repeating
I have a pandas dataframe with approximately 3 million rows.
I want to partially aggregate the last column in seperate spots based on another variable.
My solution was to separate the dataframe rows into a list of new dataframes based on that variable, aggregate the dataframes, and then join them again into a single dataframe. The problem is that after a few 10s of thousands of rows, I get a memory error. What methods can I use to improve the efficiency of my function to prevent these memory errors?
An example of my code is below
test = pd.DataFrame({"unneeded_var": [6,6,6,4,2,6,9,2,3,3,1,4,1,5,9],
"year": [0,0,0,0,1,1,1,2,2,2,2,3,3,3,3],
"month" : [0,0,0,0,1,1,1,2,2,2,3,3,3,4,4],
"day" : [0,0,0,1,1,1,2,2,2,2,3,3,4,4,5],
"day_count" : [7,4,3,2,1,5,4,2,3,2,5,3,2,1,3]})
test = test[["year", "month", "day", "day_count"]]
def agg_multiple(df, labels, aggvar, repl=None):
if(repl is None): repl = aggvar
conds = df.duplicated(labels).tolist() #returns boolean list of false for a unique (year,month) then true until next unique pair
groups = []
start = 0
for i in range(len(conds)): #When false, split previous to new df, aggregate count
bul = conds[i]
if(i == len(conds) - 1): i +=1 #no false marking end of last group, special case
if not bul and i > 0 or bul and i == len(conds):
sample = df.iloc[start:i , :]
start = i
sample = sample.groupby(labels, as_index=False).agg({aggvar:sum}).rename(columns={aggvar : repl})
groups.append(sample)
df = pd.concat(groups).reset_index(drop=True) #combine aggregated dfs into new df
return df
test = agg_multiple(test, ["year", "month"], "day_count", repl="month_count")
I suppose that I could potentially apply the function to small samples of the dataframe, to prevent a memory error and then combine those, but I'd rather improve the computation time of my function.
This function does the same, and is 10 times faster.
test.groupby(["year", "month"], as_index=False).agg({"day_count":sum}).rename(columns={"day_count":"month_count"})
There are almost always pandas methods that are pretty optimized for tasks that will vastly outperform iteration through the dataframe. If I understand correctly, in your case, the following will return the same exact output as your function:
test2 = (test.groupby(['year', 'month'])
.day_count.sum()
.to_frame('month_count')
.reset_index())
>>> test2
year month month_count
0 0 0 16
1 1 1 10
2 2 2 7
3 2 3 5
4 3 3 5
5 3 4 4
To check that it's the same:
# Your original function:
test = agg_multiple(test, ["year", "month"], "day_count", repl="month_count")
>>> test == test2
year month month_count
0 True True True
1 True True True
2 True True True
3 True True True
4 True True True
5 True True True
I'm new to pandas and enchant. I want to check orthography in short sentences using python.
I have a pandas data frame:
id_num word
1 live haapy
2 know more
3 ssweam good
4 eeat little
5 dream alot
And I want to achieve the next table with column “check”
id_num word check
1 live haapy True, False
2 know more True, True
3 ssweam good False, True
4 eeat little False, True
5 dream alot True, False
What is the best way to do this?
I tried this code:
import enchant
dic = enchant.Dict("ru_Eng")
df['list_word'] = df['word'].str.split() #Get list of all words in each sentence using split()
row = list()
for row in df[['id_num', 'list_word']].iterrows():
r = row[1]
for word in r.list_word:
rows.append((r.id_num, word))
df2 = pd.DataFrame(rows, columns=['id_num', 'word']) #Make the table with id_num column and a column of separate words
Then I got new data frame (df2):
id_num word
1 live
1 haapy
2 know
2 more
3 ssweam
3 good
4 eeat
4 little
5 dream
5 alot
After that I check words using:
column = df2['word']
for i in column:
n = dic.check(i)
print(n)
The result is:
True
False
True
True
False
True
False
True
True
False
Check is carried out correctly but when I tried to put this result to a new pandas data frame column I got all False values for all words.
for i in column:
df2['res'] = dic.check(i)
Resulted data frame:
id_num word res
1 live False
1 haapy False
2 know False
2 more False
3 ssweam False
3 good False
4 eeat False
4 little False
5 dream False
5 alot False
I will be grateful for any help!