pandas crosstab for two columns - python

I am trying to make a contingency table using pd.crosstab from my local dataframe. Imagine we asked 3 people in 2 separate groups the question of whether they like ice cream or not, and here is the result in a dataframe:
group1 | group2
------------------
yes | no
no | maybe
yes | no
And i would like the contingency table to look like this:
| group1 | group2
----------------------------
yes | 2 | 0
no | 1 | 2
maybe | 0 | 1
I have played around with pandas and evidently referenced many different resources, including the docs and other posts, but couldn't figure this out. Does anyone have any ideas? Thanks!

Pandas has a crosstab function that solve this; first you have to melt the dataframe:
box = df.melt()
pd.crosstab(box.value, box.variable)
variable group1 group2
value
maybe 0 1
no 1 2
yes 2 0
For performance, it is possible that groupby will be faster, even if it involves a few more steps:
box.groupby(["variable", "value"]).size().unstack("variable", fill_value=0)

Related

How to combine two pandas dataset based on multiple conditions?

I want to combine two datasets in Python based on multiple conditions using pandas.
The two datasets are different numbers of rows.
The first one contains almost 300k entries, while the second one contains almost 1000 entries.
More specifically, The first dataset: "A" has the following information:
Path | Line | Severity | Vulnerability | Name | Text | Title
An instance of the content of "A" is this:
src.bla.bla.class.java| 24; medium| Logging found| hr.kravarscan.enchantedfortress_15| description| Enchanted Fortress
While the second dataset: "B" contains the following information:
Class | Path | DTWC | DR | DW | IDFP
An instance of the content in "B" is this:
y.x.bla.MainActivity | com.lucao.limpazap_11| 0 | 0 | 0 | 0
I want to combine these two dataset as follow:
If A['Name'] is equal to B['Path'] AND B['Class'] is in A['Class']
Than
Merge the two lines into another data frame "C"
An output example is the following:
Suppose that A contains:
src.bla.bla.class.java| 24| medium| Logging found| hr.kravarscan.enchantedfortress_15| description| Enchanted Fortress|
and B contains:
com.bla.class | hr.kravarscan.enchantedfortress_15| 0 | 0 | 0 | 0
the output should be the following:
src.bla.bla.class.java| 24| medium| Logging found| hr.kravarscan.enchantedfortress_15| description| Enchanted Fortress| com.bla.class | hr.kravarscan.enchantedfortress_15| 0 | 0 | 0 | 0
I'm not sure if this the best and the most efficient way but i have test it and it worked. So my answer is pretty straight forward, we will loop over two dataframes and apply the desired conditions.
Suppose the dataset A is df_a and dataset B is df_b.
First we have to add a suffix on every columns on df_a and df_b so both rows can be appended later.
df_a.columns= [i+'_A' for i in df_a.columns]
df_b.columns= [i+'_B' for i in df_b.columns]
And then we can apply this for loop
df_c= pd.DataFrame()
# Iterate through df_a
for (idx_A, v_A) in df_a.iterrows():
# Iterate through df_b
for (idx_B, v_B) in df_b.iterrows():
# Apply the condition
if v_A['Name_A']==v_B['Path_B'] and v_B['Class_B'] in v_A['Path_A']:
# Cast both series to dictionary and then append them to a new dict
c_dict= {**v_A.to_dict(), **v_B.to_dict()}
# Append the df_c with c_dict
df_c= df_c.append(c_dict, ignore_index=True)

Python, Pandas: Access multiple columns in lambda-function

Good day,
I would like to ask if it's possible to access more than one column in a lambda-function inside a pandas-dataframe or if there's an alternative!?
For example my dataframe is looking something like this:
value_a | value_b | value_c
1 | 17 | 8
2 | 253 | 9
3 | 89 | 8
...
I also got a function that is calculating with some of the data:
def some_function(a, b):
...do something:
return c
Now I want to use lambda-function to calculate together with the function but include the data from two columns. Something like this...
df['value_d'] = df['value_b'].apply(lambda x: some_function(x, df['value_c']))
Is it possible to access more than one column inside such a function or is there a better solution?
Hoping my question is understandable.
Thanks to all of you and have a great day!
use apply over whole df
df['value_d'] = df.apply(lambda row: some_function(row['value_b'],row['value_c']), axis=1)

Python - Groupby a DataFrameGroupBy object

I have a panda dataframe in Python at which I am applying a groupby. And then I want to apply a new groupby + sum on the previous result. To be more specific, first I am doing:
check_df = data_df.groupby(['hotel_code', 'dp_id', 'market', 'number_of_rooms'])
[['market', 'number_of_rooms']]
And then I want to do:
check_df = check_df.groupby(['market'])['number_of_rooms'].sum()
So, I am getting the following error:
AttributeError: Cannot access callable attribute 'groupby' of 'DataFrameGroupBy'
objects, try using the 'apply' method
My initial data look like that:
hotel_code | market | number_of_rooms | ....
---------------------------------------------
001 | a | 200 | ...
001 | a | 200 |
002 | a | 300 | ...
Notice that I may have duplicates of pairs like (a - 200), that's why I want need the first groupby.
What I want in the end is something like that:
Market | Rooms
--------------
a | 3000
b | 250
I'm just trying to translate the following sql query into python:
select a.market, sum(a.number_of_rooms)
from (
select market, number_of_rooms
from opinmind_dev..cg_mm_booking_dataset_full
group by hotel_code, market, number_of_rooms
) as a
group by market ;
Any ideas how I can fix that? If you need any more info, let me know.
ps. I am new to Python and data science
IIUC, instead of:
check_df = data_df.groupby(['hotel_code', 'dp_id', 'market', 'number_of_rooms'])
[['market', 'number_of_rooms']]
You should simply do:
check_df = data_df.drop_duplicates(subset=['hotel_code', 'dp_id', 'market', 'number_of_rooms'])\
.loc[:, ['market', 'number_of_rooms']]\
.groupby('market')\
.sum()
df = pd.DataFrame({'Market': [1,1,1,2,2,2,3,3], 'Rooms':range(8), 'C':np.random.rand(8)})
Market Rooms C
0 1 0 0.187793
1 1 1 0.325284
2 1 2 0.095147
3 2 3 0.296781
4 2 4 0.022262
5 2 5 0.201078
6 3 6 0.160082
7 3 7 0.683151
You need to move the column selection away from the grouped DataFrame. Either of the following should work.
df.groupby('Market').sum()[['Rooms']]
df[['Rooms']].groupby(df['Market']).sum()
Rooms
Market
1 3
2 12
3 13
If you select using ['Rooms'] instead of [['Rooms']] you will get a Series instead of a DataFrame.
The dataframes produced use market as their index. If you want to convert it to a normal data column, use:
df.reset_index()
Market Rooms
0 1 3
1 2 12
2 3 13
If I understand your question correctly, You could simply do -
data_df.groupby('Market').agg({'Rooms': np.sum}) OR
data_df.groupby(['market'], as_index=False).agg({'Rooms': np.sum})
data_df = pd.DataFrame({'Market' : ['A','B','C','B'],
'Hotel' : ['H1','H2','H4','H5'],
'Rooms' : [20,40,50,34]
})
data_df.groupby('Market').agg({'Rooms': np.sum})

Merge two dataframes in PySpark

I have two dataframes, DF1 and DF2, DF1 is the master which stores any additional information from DF2.
Lets say the DF1 is of the following format,
Item Id | item | count
---------------------------
1 | item 1 | 2
2 | item 2 | 3
1 | item 3 | 2
3 | item 4 | 5
DF2 contains the 2 items which were already present in DF1 and two new entries. (itemId and item are considered as a single group, can be treated as the key for join)
Item Id | item | count
---------------------------
1 | item 1 | 2
3 | item 4 | 2
4 | item 4 | 4
5 | item 5 | 2
I need to combine the two dataframes such that the existing items count are incremented and new items are inserted.
The result should be like:
Item Id | item | count
---------------------------
1 | item 1 | 4
2 | item 2 | 3
1 | item 3 | 2
3 | item 4 | 7
4 | item 4 | 4
5 | item 5 | 2
I have one way do achieve this not sure if its efficient or the right way to do
temp1 = df1.join(temp,['item_id','item'],'full_outer') \
.na.fill(0)
temp1\
.groupby("item_id", "item")\
.agg(F.sum(temp1["count"] + temp1["newcount"]))\
.show()
Since, the schema for the two dataframes is the same you can perform a union and then do a groupby id and aggregate the counts.
step1: df3 = df1.union(df2);
step2: df3.groupBy("Item Id", "item").agg(sum("count").as("count"));
There are several ways how to do it.
Based on what you describe the most straightforward solution would be to use RDD - SparkContext.union:
rdd1 = sc.parallelize(DF1)
rdd2 = sc.parallelize(DF2)
union_rdd = sc.union([rdd1, rdd2])
the alternative solution would be to use DataFrame.union from pyspark.sql
Note: I have suggested unionAll previously but it is deprecated in Spark 2.0
#wandermonk's solution is recommended as it does not use join. Avoid joins as much as possible as this triggers shuffling (also known as wide transformation and leads to data transfer over the network and that is expensive and slow)
You also have to look into your data size (both tables are big or one small one big etc) and accordingly you can tune the performance side of it.
I tried showing the group by a solution using SparkSQL as they do the same thing but easier to understand and manipulate.
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
list_1 = [[1,"item 1" , 2],[2 ,"item 2", 3],[1 ,"item 3" ,2],[3 ,"item 4" , 5]]
list_2 = [[1,"item 1",2],[3 ,"item 4",2],[4 ,"item 4",4],[5 ,"item 5",2]]
my_schema = StructType([StructField("Item_ID",IntegerType(), True),StructField("Item_Name",StringType(), True ),StructField("Quantity",IntegerType(), True)])
df1 = spark.createDataFrame(list_1, my_schema)
df2 = spark.createDataFrame(list_2, my_schema)
df1.createOrReplaceTempView("df1")
df1.createOrReplaceTempView("df2")
df3 = df2.union(df1)
df3.createOrReplaceTempView("df3")
df4 = spark.sql("select Item_ID, Item_Name, sum(Quantity) as Quantity from df3 group by Item_ID, Item_Name")
df4.show(10)
now if you look into the SparkUI, you can see for such a small data set, the shuffle operation, and # of stages.
Number of stages for such a small job
Number the shuffle operation for this group by command
I also recommend to see the SQL plan and understand the cost. Exchange represents the shuffle here.
== Physical Plan ==
*(2) HashAggregate(keys=[Item_ID#6, Item_Name#7], functions=[sum(cast(Quantity#8 as bigint))], output=[Item_ID#6, Item_Name#7, Quantity#32L])
+- Exchange hashpartitioning(Item_ID#6, Item_Name#7, 200)
+- *(1) HashAggregate(keys=[Item_ID#6, Item_Name#7], functions=[partial_sum(cast(Quantity#8 as bigint))], output=[Item_ID#6, Item_Name#7, sum#38L])
+- Union
:- Scan ExistingRDD[Item_ID#6,Item_Name#7,Quantity#8]
+- Scan ExistingRDD[Item_ID#0,Item_Name#1,Quantity#2]

Extract first "set of rows" matching a particular condition in Spark Dataframe (Pyspark)

I have a Spark DataFrame with data like below:
ID | UseCase
-----------------
0 | Unidentified
1 | Unidentified
2 | Unidentified
3 | Unidentified
4 | UseCase1
5 | UseCase1
6 | Unidentified
7 | Unidentified
8 | UseCase2
9 | UseCase2
10 | UseCase2
11 | Unidentified
12 | Unidentified
I have to extract the top 4 rows which have value Unidentified in column UseCase and do further processing with them. I don't want to get the middle and last two rows with Unidentified value at this point.
I want to avoid using the ID column as they are not fixed. The above data is just a sample.
When I use map function (after converting this to RDD) or UDFs, I end up with 8 rows in my output DataFrame (which is expected of these functions).
How can this be achieved? I am working in PySpark. I don't want to use collect on the DataFrame and get it as a list to iterate over. This would defeat the purpose of Spark. The DataFrame size can go up to 4-5 GB.
Could you please suggest how this can be done?
Thanks in advance!
Just do a filter and a limit. The following code is Scala, but you'll understand the point.
Assume your dataframe is called df, then:
df.filter($"UseCase"==="Unidentified").limit(4).collect()

Categories

Resources