how to change the iterrows method to apply - python

I have this code, in which I have rows around 60k. It taking around 4 hrs to complete the whole process. This code is not feasible and want to use apply instead iterrow because of time constraints.
Here is the code,
all_merged_k = pd.DataFrame(columns=all_merged_f.columns)
for index, row in all_merged_f.iterrows():
if (row['route_count'] == 0):
all_merged_k = all_merged_k.append(row)
else:
for i in range(row['route_count']):
row1 = row.copy()
row['Route Number'] = i
row['Route_Broken'] = row1['routes'][i]
all_merged_k = all_merged_k.append(row)
Basically, what the code is doing is that if the route count is 0 then append the same row, if not then whatever the number of counts is it will append that number of rows with all same value except the routes column (as it contains nested list) so breaking them in multiple rows. And adding them in new columns called Route_Broken and Route Number.
Sample of data:
routes route_count
[[CHN-IND]] 1
[[CHN-IND],[IND-KOR]] 2
O/P data:
routes route_count Broken_Route Route Number
[[CHN-IND]] 1 [CHN-IND] 1
[[CHN-IND],[IND-KOR]] 2 [CHN-IND] 1
[[CHN-IND],[IND-KOR]] 2 [IND-KOR] 2
Can it be possible using apply because 4 hrs is very high and cant be put into production. I need extreme help. Pls help me.
So below code doesn't work
df.join(df['routes'].explode().rename('Broken_Route')) \
.assign(**{'Route Number': lambda x: x.groupby(level=0).cumcount().add(1)})
or
(df.assign(Broken_Route=df['routes'],
count=df['routes'].str.len().apply(range))
.explode(['Broken_Route', 'count'])
)
It doesn't working if the index matches, we can see the last row, Route Number should be 1

Are you expect something like that:
>>> df.join(df['routes'].explode().rename('Broken_Route')) \
.assign(**{'Route Number': lambda x: x.groupby(level=0).cumcount().add(1)})
routes route_count Broken_Route Route Number
0 [[CHN-IND]] 1 [CHN-IND] 1
1 [[CHN-IND], [IND-KOR]] 2 [CHN-IND] 1
1 [[CHN-IND], [IND-KOR]] 2 [IND-KOR] 2
2 0 1
Setup:
data = {'routes': [[['CHN-IND']], [['CHN-IND'], ['IND-KOR']], ''],
'route_count': [1, 2, 0]}
df = pd.DataFrame(data)
Update 1: added a record with route_count=0 and routes=''.

You can assign the routes and counts and explode:
(df.assign(Broken_Route=df['routes'],
count=df['routes'].str.len().apply(range))
.explode(['Broken_Route', 'count'])
)
NB. multi-column explode requires pandas ≥1.3.0, if older use this method
output:
routes route_count Broken_Route count
0 [[CHN-IND]] 1 [CHN-IND] 0
1 [[CHN-IND], [IND-KOR]] 2 [CHN-IND] 0
1 [[CHN-IND], [IND-KOR]] 2 [IND-KOR] 1

Related

Cannot set a DataFrame with multiple columns to the single column total_servings

I am a beginner and getting familiar with pandas .
It is throwing an error , When I was trying to create a new column this way :
drinks['total_servings'] = drinks.loc[: ,'beer_servings':'wine_servings'].apply(calculate,axis=1)
Below is my code, and I get the following error for line number 9:
"Cannot set a DataFrame with multiple columns to the single column total_servings"
Any help or suggestion would be appreciated :)
import pandas as pd
drinks = pd.read_csv('drinks.csv')
def calculate(drinks):
return drinks['beer_servings']+drinks['spirit_servings']+drinks['wine_servings']
print(drinks)
drinks['total_servings'] = drinks.loc[:, 'beer_servings':'wine_servings'].apply(calculate,axis=1)
drinks['beer_sales'] = drinks['beer_servings'].apply(lambda x: x*2)
drinks['spirit_sales'] = drinks['spirit_servings'].apply(lambda x: x*4)
drinks['wine_sales'] = drinks['wine_servings'].apply(lambda x: x*6)
drinks
In your code, when functioncalculate is called with axis=1, it passes each row of the Dataframe as an argument. Here, the function calculate is returning dataframe with multiple columns but you are trying to assigned to a single column, which is not possible. You can try updating your code to this,
def calculate(each_row):
return each_row['beer_servings'] + each_row['spirit_servings'] + each_row['wine_servings']
drinks['total_servings'] = drinks.apply(calculate, axis=1)
drinks['beer_sales'] = drinks['beer_servings'].apply(lambda x: x*2)
drinks['spirit_sales'] = drinks['spirit_servings'].apply(lambda x: x*4)
drinks['wine_sales'] = drinks['wine_servings'].apply(lambda x: x*6)
print(drinks)
I suppose the reason is the wrong argument name inside calculate method. The given argument is drink but drinks used to calculate sum of columns.
The reason is drink is Series object that represents Row and sum of its elements is scalar. Meanwhile drinks is a DataFrame and sum of its columns will be a Series object
Sample code shows that this method works.
import pandas as pd
df = pd.DataFrame({
"A":[1,1,1,1,1],
"B":[2,2,2,2,2],
"C":[3,3,3,3,3]
})
def calculate(to_calc_df):
return to_calc_df["A"] + to_calc_df["B"] + to_calc_df["C"]
df["total"] = df.loc[:, "A":"C"].apply(calculate, axis=1)
print(df)
Result
A B C total
0 1 2 3 6
1 1 2 3 6
2 1 2 3 6
3 1 2 3 6
4 1 2 3 6

Reading values from datafram.iloc is too slow and problem in dataframe.values

I use python and I have data of 35 000 rows I need to change values by loop but it takes too much time
ps: I have columns named by succes_1, succes_2, succes_5, succes_7....suces_120 so I get the name of the column by the other loop the values depend on the other column
exemple:
SK_1 Sk_2 Sk_5 .... SK_120 Succes_1 Succes_2 ... Succes_120
1 0 1 0 1 0 0
1 1 0 1 2 1 1
for i in range(len(data_jeux)):
for d in range (len(succ_len)):
ids = succ_len[d]
if data_jeux['SK_%s' % ids][i] == 1:
data_jeux.iloc[i]['Succes_%s' % ids]= 1+i
I ask if there is a way for executing this problem with the faster way I try :
data_jeux.values[i, ('Succes_%s' % ids)] = 1+i
but it returns me the following error maybe it doesn't accept string index
You can define columns and then use loc to increment. It's not clear whether your columns are naturally ordered; if they aren't you can use sorted with a custom function. String-based sorting will cause '20' to come before '100'.
def splitter(x):
return int(x.rsplit('_', maxsplit=1)[-1])
cols = df.columns
sk_cols = sorted(cols[cols.str.startswith('SK')], key=splitter)
succ_cols = sorted(cols[cols.str.startswith('Succes')], key=splitter)
df.loc[df[sk_cols] == 1, succ_cols] += 1

Counting Values In Columns Igonorig AlphaNumeric Values

First post here, I am trying to find out total count of values in an excel file. So after importing the file, I need to run a condition which is count all the values except 0 also where it finds 0 make that blank.
> df6 = df5.append(df5.ne(0).sum().rename('Final Value'))
I tried the above one but not working properly, It is counting the column name as well, I only need to count the float values.
Demo DataFrame:
0 1 2 3
ID_REF 1007_s 1053_a 117_at 121_at
GSM95473 0.08277 0.00874 0.00363 0.01877
GSM95474 0.09503 0.00592 0.00352 0
GSM95475 0.08486 0.00678 0.00386 0.01973
GSM95476 0.08105 0.00913 0.00306 0.01801
GSM95477 0.00000 0.00812 0.00428 0
GSM95478 0.07615 0.00777 0.00438 0.01799
GSM95479 0 0.00508 1 0
GSM95480 0.08499 0.00442 0.00298 0.01897
GSM95481 0.08893 0.00734 0.00204 0
0 1 2 3
ID_REF 1007_s 1053_a 117_at 121_at
These are column name and index value which needs to be ignored when counting.
The output Should be like this after counting:
Final 8 9 9 5
If you just nee the count, but change the values in your dataframe, you could apply a function to each cell in your DataFrame with the applymap method. First create a function to check for a float:
def floatcheck(value):
if isinstance(value, float):
return 1
else:
return 0
Then apply it to your dataframe:
df6 = df5.applymap(floatcheck)
This will create a dataframe with a 1 if the value is a float and a 0 if not. Then you can apply your sum method:
df7 = df6.append(df6.sum().rename("Final Value"))
I was able to solve the issue, So here it is:
df5 = df4.append(pd.DataFrame(dict(((df4[1:] != 1) & (df4[1:] != 0)).sum()), index=['Final']))
df5.columns = df4.columns
went = df5.to_csv("output3.csv")
What i did was i changed the starting index so i didn't count the first row which was alphanumeric and then i just compared it.
Thanks for your response.

Convert a string column to number in a dataframe

I'm trying to convert a column in my DataFrame to numbers. The input is email domains extracted from email addresses. Sample:
>>> data['emailDomain']
0 [gmail]
1 [gmail]
2 [gmail]
3 [aol]
4 [yahoo]
5 [yahoo]
I want to create a new column where if the domain is gmail or aol, the column entry would be a 1 and 0 otherwise.
I created a method which goes like this:
def convertToNumber(row):
try:
if row['emailDomain'] == '[gmail]':
return 1
elif row['emailDomain'] == '[aol]':
return 1
elif row['emailDomain'] == '[outlook]':
return 1
elif row['emailDomain'] == '[hotmail]':
return 1
elif row['emailDomain'] == '[yahoo]':
return 1
else:
return 0
except TypeError:
print("TypeError")
and used it like:
data['validEmailDomain'] = data.apply(convertToNumber, axis=1)
However, my output column is 0 even when I know there are gmail and aol emails present in the input column.
Any idea what could be going wrong?
Also, I think this usage of conditional statements might not be the most efficient way to tackle this problem. Is there any other approach to getting this done?
you can use series.isin
providers = {'gmail', 'aol', 'yahoo','hotmail', 'outlook'}
data['emailDomain'].isin(providers)
searching the provider
instead of applying a re to each email in each row, you can use the Series.str methods to do it on a columns at a time
pattern2 = '(?<=#)([^.]+)(?=\.)'
df['email'].str.extract(pattern2, expand=False)
so this becomes something like this:
pattern2 = '(?<=#)([^.]+)(?=\.)'
providers = {'gmail', 'aol', 'yahoo','hotmail', 'outlook'}
df = pd.DataFrame(data={'email': ['test.1#gmail.com', 'test.2#aol.com', 'test3#something.eu']})
provider_serie = df['email'].str.extract(pattern2, expand=False)
0 gmail
1 aol
2 something
Name: email, dtype: object
interested_providers = df['email'].str.extract(pattern2, expand=False).isin(providers)
0 True
1 True
2 False
Name: email, dtype: bool
If you really want 0s and 1s, you can add a .astype(int)
Your code would work if your series contained strings. As such, they likely contain lists, in which case you need to extract the first element.
I would also utilise pd.Series.map instead of using any row-wise logic. Below is a complete example:
df = pd.DataFrame({'emailDomain': [['gmail'], ['gmail'], ['gmail'], ['aol'],
['yahoo'], ['yahoo'], ['else']]})
domains = {'gmail', 'aol', 'outlook', 'hotmail', 'yahoo'}
df['validEmailDomain'] = df['emailDomain'].map(lambda x: x[0]).isin(domains)\
.astype(int)
print(df)
# emailDomain validEmailDomain
# 0 [gmail] 1
# 1 [gmail] 1
# 2 [gmail] 1
# 3 [aol] 1
# 4 [yahoo] 1
# 5 [yahoo] 1
# 6 [else] 0
You could sum up the occurence checks of every Provider via list comprehensions and write the resulting list into data['validEmailDomain']:
providers = ['gmail', 'aol', 'outlook', 'hotmail', 'yahoo']
data['validEmailDomain'] = [np.sum([p in e for p in providers]) for e in data['emailDomain'].values]

Find first time a value occurs in the dataframe

I have a dataframe with year-quarter (e.g. 2015-Q4), the customer_ID, and amount booked, and many other columns irrelevant for now. I want to create a column that has the first time each customer made a booking. I tried this:
alldata.sort_values(by=['Total_Apps_Reseller_Bookings_USD', 'Year_Quarter'],
ascending=[1, 1],
inplace=True)
first_q = alldata[['Customer_ID', 'Year_Quarter']].groupby(by='Customer_ID').first()
but I am not sure it worked.
Also, I then want to have another column that tells me how many quarters after the first booking that booking was made. I failed using replace and dictionary, so I used a merge. I create an numeric id for each quarter of booking, and first quarter from above, and then subtract the two:
q_booking_num = pd.DataFrame({'Year_Quarter': x, 'First_Quarter_id': np.arange(28)})
alldata = pd.merge(alldata, q_booking_num, on='Year_Quarter', how='outer')
q_first_num = pd.DataFrame({'First_Quarter': x, 'First_Quarter_id': np.arange(28)})
alldata = pd.merge(alldata, q_first_num, on='First_Quarter', how='outer')
this doesn't seem to have worked at all as I see 'first quarters' that are after some bookings that were already made.
You need to specify which column to use for taking the first value:
first_q = (alldata[['Customer_ID','Year_Quarter']]
.groupby(by='Customer_ID')
.Year_Quarter
.first()
)
Here is some sample data for three customers:
df = pd.DataFrame({'customer_ID': [1,
2, 2,
3, 3, 3],
'Year_Quarter': ['2010-Q1',
'2010-Q1', '2011-Q1',
'2010-Q1', '2011-Q1', '2012-Q1'],
'Total_Apps_Reseller_Bookings_USD': [1,
2, 3,
4, 5, 6]})
Below, I convert text quarters (e.g. '2010-Q1') to a numeric equivalent by taking the int value of the first for characters (df.Year_Quarter.str[:4].astype(int)). I then multiply it by four and add the value of the quarter. This value is only used for differencing to determine the total number of quarters since the first order.
Next, I use transform on the groupby to take the min value of these quarters we just calculated. Using transform keeps this value in the same shape as the original dataframe.
I then calcualte the quarters_since_first_order as the difference between the quarter and the first quarter.
df['quarters'] = df.Year_Quarter.str[:4].astype(int) * 4 + df.Year_Quarter.str[-1].astype(int)
first_order_quarter_no = df.groupby('customer_ID').quarters.transform(min)
df['quarters_since_first_order'] = quarters - first_order_quarter_no
del df['quarters'] # Clean-up.
>>> df
Total_Apps_Reseller_Bookings_USD Year_Quarter customer_ID quarters_since_first_order
0 1 2010-Q1 1 0
1 2 2010-Q1 2 0
2 3 2011-Q1 2 4
3 4 2010-Q1 3 0
4 5 2011-Q1 3 4
5 6 2012-Q1 3 8
For part 1:
I think you need to sort a little differently to get your desired outcome:
alldata.sort_values(by=['Customer_ID', 'Year_Quarter',
'Total_Apps_Reseller_Bookings_USD'],
ascending=[1, 1],inplace=True)
first_q = alldata[['Customer_ID','Year_Quarter']].groupby(by='Customer_ID').head(1)
For part 2:
Continuing off of part 1, you can merge the values back on to the original dataframe. At that point, you can write a custom function to subtract your date strings and then apply it to each row.
Something like:
def qt_sub(val, first):
year_dif = val[0:4] - first[0:4]
qt_dif = val[6] - first[6]
return 4 * int(year_dif) + int(qt_dif)
alldata['diff_from_first'] = alldata.apply(lambda x: qt_sub(x['Year_Quarter'],
x['First_Sale']),
axis=1)

Categories

Resources