I have done some research on this, but couldn't find a concise method when the index is of type 'string'.
Given the following Pandas dataframe:
Platform | Action | RPG | Fighting
----------------------------------------
PC | 4 | 6 | 9
Playstat | 6 | 7 | 5
Xbox | 9 | 4 | 6
Wii | 8 | 8 | 7
I was trying to get the index (Platform) of the smallest value in the 'RPG' column, which would return 'Xbox'. I managed to make it work but it's not efficient, and looking for a better/quicker/condensed approach. Here is what I got:
# Return the minimum value of a series of all columns values for RPG
series1 = min(ign_data.loc['RPG'])
# Find the lowest value in the series
minim = min(ign_data.loc['RPG'])
# Get the index of that value using boolean indexing
result = series1[series1 == minim].index
# Format that index to a list, and return the first (and only) element
str_result = result.format()[0]
Use Series.idxmin:
df.set_index('Platform')['RPG'].idxmin()
#'Xbox'
or what #Quang Hoang suggests on the comments
df.loc[df['RPG'].idxmin(), 'Platform']
if Platform already the index:
df['RPG'].idxmin()
EDIT
df.set_index('Platform').loc['Playstat'].idxmin()
#'Fighting'
df.set_index('Platform').idxmin(axis=1)['Playstat']
#'Fighting'
if already the index:
df.loc['Playstat'].idxmin()
Related
I have a dataframe which contains Date, Visitor_ID and Pages columns. In the Page_visited column there are different row wise entries for each dates. Please refer the below table to understand the data.
[| Dates | Visitor_ID| Pages |
|:------ |:---------:| -----: |
| 10/1/2021 | 1 | xy |
| 10/1/2021 | 1 | step2 |
|10/1/2021 | 1 | xx |
|10/1/2021 | 1 | NetBanking|
| 10/1/2021 | 2 | step1 |
| 10/1/2021 | 2 | xy |
|10/1/2021 | 3 | step1 |
|10/1/2021 | 3 | NetBanking|
|11/1/2021 | 4 | step1 |
|12/1/2021 | 4 | NetBanking|][1]
Desired output:
Date Visitor_ID
|10/1/2021 | 1 |
|10/1/2021 | 3 |
the output should be a subset of actual data where the condition is that if for same Visitor_ID the page contains string "step" before string "Netbanking in same date then return the Visitor ID.
To initialise your dataframe you could do:
import pandas as pd
columns = ["Dates", "Visitor_ID", "Pages"]
records = [
["10/1/2021", 1, "xy"],
["10/1/2021", 1, "step2"],
["10/1/2021", 1, "NetBanking"],
["10/1/2021", 2, "step1"],
["10/1/2021", 2, "xy"],
["10/1/2021", 3, "step1"],
["10/1/2021", 3, "NetBanking"],
["11/1/2021", 4, "step1"],
["12/1/2021", 4, "NetBanking"]]
data = pd.DataFrame().from_records(records, columns=columns)
data["Dates"] = pd.DatetimeIndex(data["Dates"])
index_names = columns[:2]
data.set_index(index_names, drop=True, inplace=True)
Note that I have left out your third line in the records, otherwise I cannot reproduce your desired output. I have made this a multi-index data frame in order to easily loop over the groups 'date/visitor'. The structure of the dataframe looks like:
print(data)
Pages
Dates Visitor_ID
2021-10-01 1 xy
1 step2
1 NetBanking
2 step1
2 xy
3 step1
3 NetBanking
2021-11-01 4 step1
2021-12-01 4 NetBanking
Now to select the customers from the same date and from the same group, I am going to loop over these groups and use 2 masks to select the required records:
for date_time, data_per_date in data.groupby(level=0):
for visitor, data_per_visitor in data_per_date.groupby(level=0):
# select the column with the Pages
pages = data_per_visitor["Pages"].str
# make 2 boolean masks, for the records with step and netbanking
has_step = pages.contains("step")
has_netbanking = pages.contains("NetBanking")
# to get the records after each 'step' records, apply a diff on 'has_step'
# Convert to int first for the correct result
# each diff with outcome -1 fulfills this requirement. Make a
# mask based on this requirement
diff_step = has_step.astype(int).diff()
records_after_step = diff_step == -1
# combine the 2 mask to create your final mask to make a selection
mask = records_after_step & has_netbanking
# select the records and print to screen
selection = data_per_visitor[mask]
if not selection.empty:
print(selection.reset_index()[index_names])
This gives the following output:
Dates Visitor_ID
0 2021-10-01 1
1 2021-10-01 3
EDIT:
I was reading your question again. The solution above assumed that only records with 'NetBanking' directly following a record with 'step' is valid. That is why I thought your example input was not corresponding with your desired output. However, in case you are allowing rows in between an occurrence with 'step' and the first 'netbanking', the solution does not work. In that case, it is better to explicitly iterate of the rows of your dataframe per date and client id. An example then would be:
for date_time, data_per_date in data.groupby(level=0):
for visitor, data_per_visitor in data_per_date.groupby(level=0):
after_step = False
index_selection = list()
data_per_visitor.reset_index(inplace=True)
for index, records in data_per_visitor.iterrows():
page = records["Pages"]
if "step" in page and not after_step:
after_step = True
if "NetBanking" in page and after_step:
index_selection.append(index)
after_step = False
selection = data_per_visitor.reindex(index_selection)
if not selection.empty:
print(selection.reset_index()[index_names]
Normally I would not recommend to use 'iterrows' as it is really slow, but in this case I don't see an easy other solution. The output of the second algorithm is the same as the first for my data. In case you do include the third line from your example data, the second algorithm still gives the same output.
Im trying to create a column where i sum the previous x rows of a column by a parm given in a different column row.
I have a solution but its really slow so i was wondering if anyone could help do this alot faster.
| time | price |parm |
|--------------------------|------------|-----|
|2020-11-04 00:00:00+00:00 | 1.17600 | 1 |
|2020-11-04 00:01:00+00:00 | 1.17503 | 2 |
|2020-11-04 00:02:00+00:00 | 1.17341 | 3 |
|2020-11-04 00:03:00+00:00 | 1.17352 | 2 |
|2020-11-04 00:04:00+00:00 | 1.17422 | 3 |
and the slow slow code
#jit
def rolling_sum(x,w):
return np.convolve(x,np.ones(w,dtype=int),'valid')
#jit
def rol(x,y):
for i in range(len(x)):
res[i] = rolling_sum(x, y[i])[0]
return res
dfa = df[:500000]
res = np.empty(len(dfa))
r = rol(dfa.l_x.values, abs(dfa.mb).values+1)
r
Maybe something like this could work. I have made up an example with to_be_summed being the column of the value that should be summed up and looback holding the number of rows to be looked back
df = pd.DataFrame({"to_be_summed": range(10), "lookback":[0,1,2,3,2,1,4,2,1,2]})
summed = df.to_be_summed.cumsum()
result = [summed[i] - summed[max(0,i - lookback - 1)] for i, lookback in enumerate(df.lookback)]
What I did here is to first do a cumsum over the column that should be summed up. Now, for the i-th entry I can take the entry of this cumsum, and subtract the one i + 1 steps back. Note that this include the i-th value in the sum. If you don't want to inlcude it, you just have to change from summed[i] to summed[i - 1]. Also note that this part max(0,i - lookback - 1) will prevent you from accidentally looking back too many rows.
I know there are some questions on Stack Overflow on Sumproduct but the solution are not working for me. I am also new to Python Pandas.
For each row, I want to do a sumproduct of certain columns only if column['2020'] !=0.
I used the below code, but get error:
IndexError: ('index 2018 is out of bounds for axis 0 with size 27', 'occurred at index 0')
Pls help. Thank you
# df_copy is my dataframe
column_list=[2018,2019]
weights=[6,9]
def test(df_copy):
if df_copy[2020]!=0:
W_Avg=sum(df_copy[column_list]*weights)
else:
W_Avg=0
return W_Avg
df_copy['sumpr']=df_copy.apply(test, axis=1)
df_copy
**|2020 | 2018 | 2019 | sumpr|**
|0 | 100 | 20 | 0 |
|1 | 30 | 10 | 270 |
|3 | 10 | 10 | 150 |
I am sorry if the table doesn't look like a table. I can't create a table properly in Stackoverflow.
Basically for a particular row, if
2020 = 2 ,
2018 =30 ,
2019 =10 ,
sumpr= 30 * 9 + 10*9 = 270
Your column names are most likely strings, not integers.
To confirm it, run df_copy.columns and you should receive something like:
Index(['2020', '2018', '2019'], dtype='object')
(note apostrophes surrounding column names).
So change your column list to:
column_list = ['2018', '2019']
In your function change also the column name to a string:
df_copy['2020']
Then your code should run.
You can also run a more concise code:
df_copy['sumpr'] = np.where(df_copy['2020'] != 0, (df_copy[column_list]
* weights).sum(axis=1), 0)
I want to add values in a dataframe. But i want to write clean code (short and faster). I really want to improve my skill in writing.
Suppose that we have a DataFrame and 3 values
df=pd.DataFrame({"Name":[],"ID":[],"LastName":[]})
value1="ema"
value2=023123
value3="Perez"
I can write:
df.append([value1,value2,value3])
but the output is gonna create a new column
like
0 | Name | ID | LastName
ema | nan | nan | nan
023123 | nan | nan| nan
Perez | nan | nan | nan
i want the next output with the best clean code
Name | ID | LastName
ema | 023123 | Perez
There are a way to do this , without append one by one? (i want the best short\fast code)
You can convert the values to dict then use append
df.append(dict(zip(['Name', 'ID', 'LastName'],[value1,value2,value3])), ignore_index=True)
Name ID LastName
0 ema 23123.0 Perez
Here the explanation:
First set your 3 values into an array
values=[value1,value2,value3]
and make variable as index marker when lopping latter
i = 0
Then use the code below
for column in df.columns:
df.loc[0,column] = values[i]
i+=1
column in df.columns will give you all the name of the column in the DataFrame
and df.loc[0,column] = values[i] will set the values at index i to row=0 and column=column
[Here the code and the result]
I have two dataframes, DF1 and DF2, DF1 is the master which stores any additional information from DF2.
Lets say the DF1 is of the following format,
Item Id | item | count
---------------------------
1 | item 1 | 2
2 | item 2 | 3
1 | item 3 | 2
3 | item 4 | 5
DF2 contains the 2 items which were already present in DF1 and two new entries. (itemId and item are considered as a single group, can be treated as the key for join)
Item Id | item | count
---------------------------
1 | item 1 | 2
3 | item 4 | 2
4 | item 4 | 4
5 | item 5 | 2
I need to combine the two dataframes such that the existing items count are incremented and new items are inserted.
The result should be like:
Item Id | item | count
---------------------------
1 | item 1 | 4
2 | item 2 | 3
1 | item 3 | 2
3 | item 4 | 7
4 | item 4 | 4
5 | item 5 | 2
I have one way do achieve this not sure if its efficient or the right way to do
temp1 = df1.join(temp,['item_id','item'],'full_outer') \
.na.fill(0)
temp1\
.groupby("item_id", "item")\
.agg(F.sum(temp1["count"] + temp1["newcount"]))\
.show()
Since, the schema for the two dataframes is the same you can perform a union and then do a groupby id and aggregate the counts.
step1: df3 = df1.union(df2);
step2: df3.groupBy("Item Id", "item").agg(sum("count").as("count"));
There are several ways how to do it.
Based on what you describe the most straightforward solution would be to use RDD - SparkContext.union:
rdd1 = sc.parallelize(DF1)
rdd2 = sc.parallelize(DF2)
union_rdd = sc.union([rdd1, rdd2])
the alternative solution would be to use DataFrame.union from pyspark.sql
Note: I have suggested unionAll previously but it is deprecated in Spark 2.0
#wandermonk's solution is recommended as it does not use join. Avoid joins as much as possible as this triggers shuffling (also known as wide transformation and leads to data transfer over the network and that is expensive and slow)
You also have to look into your data size (both tables are big or one small one big etc) and accordingly you can tune the performance side of it.
I tried showing the group by a solution using SparkSQL as they do the same thing but easier to understand and manipulate.
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
list_1 = [[1,"item 1" , 2],[2 ,"item 2", 3],[1 ,"item 3" ,2],[3 ,"item 4" , 5]]
list_2 = [[1,"item 1",2],[3 ,"item 4",2],[4 ,"item 4",4],[5 ,"item 5",2]]
my_schema = StructType([StructField("Item_ID",IntegerType(), True),StructField("Item_Name",StringType(), True ),StructField("Quantity",IntegerType(), True)])
df1 = spark.createDataFrame(list_1, my_schema)
df2 = spark.createDataFrame(list_2, my_schema)
df1.createOrReplaceTempView("df1")
df1.createOrReplaceTempView("df2")
df3 = df2.union(df1)
df3.createOrReplaceTempView("df3")
df4 = spark.sql("select Item_ID, Item_Name, sum(Quantity) as Quantity from df3 group by Item_ID, Item_Name")
df4.show(10)
now if you look into the SparkUI, you can see for such a small data set, the shuffle operation, and # of stages.
Number of stages for such a small job
Number the shuffle operation for this group by command
I also recommend to see the SQL plan and understand the cost. Exchange represents the shuffle here.
== Physical Plan ==
*(2) HashAggregate(keys=[Item_ID#6, Item_Name#7], functions=[sum(cast(Quantity#8 as bigint))], output=[Item_ID#6, Item_Name#7, Quantity#32L])
+- Exchange hashpartitioning(Item_ID#6, Item_Name#7, 200)
+- *(1) HashAggregate(keys=[Item_ID#6, Item_Name#7], functions=[partial_sum(cast(Quantity#8 as bigint))], output=[Item_ID#6, Item_Name#7, sum#38L])
+- Union
:- Scan ExistingRDD[Item_ID#6,Item_Name#7,Quantity#8]
+- Scan ExistingRDD[Item_ID#0,Item_Name#1,Quantity#2]