Apply function to 2nd column in pandas dataframe groupby - python

In a pandas dataframe, a function can be used to group its index. I'm looking to define a function that instead is applied to a column.
I'm looking to group by two columns, except I need the second column to be grouped by an arbitrary function, foo:
group_sum = df.groupby(['name', foo])['tickets'].sum()
How would foo be defined to group the second column into two groups, demarcated by whether values are > 0, for example? Or, is an entirely different approach or syntax used?

Groupby can accept any combination of both labels and series/arrays (as long as the array has the same length as your dataframe), so you can map the function to your column and pass it into the groupby, like
df.groupby(['name', df[1].map(foo)])
Alternatively you might want to add the condition as a new column to your dataframe before your perform the groupby, this will have the advantage of giving it a name in the index:
df['>0'] = df[1] > 0
group_sum = df.groupby(['name', '>0'])['tickets'].sum()

Something like this will work:
x.groupby(['name', x['value']>0])['tickets'].sum()
Like mentioned above the groupby can accept labels and series. This should give you the answer you are looking for. Here is an example:
data = np.array([[1, -1, 20], [1, 1, 50], [1, 1, 50], [2, 0, 100]])
x = pd.DataFrame(data, columns = ['name', 'value', 'value2'])
x.groupby(['name', x['value']>0])['value2'].sum()
name value
1 False 20
True 100
2 False 100
Name: value2, dtype: int64

Related

pandas value_counts(): directly compare two instances

I have done .value_counts() on two dataFrames (similar column) and would like to compare the two.
I also tried with converting the resulting Series to dataframes (.to_frame('counts') as suggested in this thread), but it doesn't help.
first = df1['company'].value_counts()
second = df2['company'].value_counts()
I tried to merge but I think the main problem is that I dont have the company name as a column but its the index (?). Is there a way to resolve it or to use a different way to get the comparison?
GOAL: The end goal is to be able to see which companies occur more in df2 than in df1, and the value_counts() themselves (or the difference between them).
You might use collections.Counter ability to subtract as follows
import collections
import pandas as pd
df1 = pd.DataFrame({'company':['A','A','A','B','B','C','Y']})
df2 = pd.DataFrame({'company':['A','B','B','C','C','C','Z']})
c1 = collections.Counter(df1['company'])
c2 = collections.Counter(df2['company'])
c1.subtract(c2)
print(c1)
gives output
Counter({'A': 2, 'Y': 1, 'B': 0, 'Z': -1, 'C': -2})
Explanation: where value is positive means more instances are in df1, where value is zero then number is equal, where value is negative means more instances are in df2.
Use from this code
df2['x'] = '2'
df1['x'] = '1'
df = pd.concat([df1[['company', 'x']], df2[['company', 'x']]])
df = pd.pivot_table(df, index=['company'], columns=['x'], aggfunc={'values': 'sum'}).reset_index()
Now filter on df for related data

Select values from different Dataframe column based on a list of index

Here is my issue, I have a dataframe, let's say:
df = DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3, 5]})
I also have a list of index:
idx = [1, 2]
I would like to store in a list, the corresponding value in each column.
Meaning I want the first value of the col1 and the second value of the col2.
I'm sure there is a simple answer to my issue however I'm mixing everything up with iloc and cannot find a way of developing a optimized method in my case (I have 1000 rows and 4 columns).
IIUC, you can try:
you can extract the complete rows and then pick the diagonal elements
result = np.diag(df.values[idx])
Alternative:
convert the dataframe to numpy array.
use numpy indexing to access the required values.
result = df.values[idx, range(len(df.columns))]
OUTPUT:
array([6, 3])
Use:
list(df.values[idx, range(len(idx))])
Output:
[6, 3]
Here is a different way:
df.stack().loc[list(zip(idx,df.columns[:(len(idx))]))].to_numpy()

Python: Extract unique index values and use them in a loop

I would like to apply the loop below where for each index value the unique values of a column called SERIAL_NUMBER will be returned. Essentially I want to confirm that for each index there is a unique serial number.
index_values = df.index.levels
for i in index_values:
x = df.loc[[i]]
x["SERIAL_NUMBER"].unique()
The problem, however, is that my dataset has a multi-index and as you can see below it is stored in a frozen list. I am just interested in the index values that contain a long number. The word "vehicle" also as an index can be removed as it is repeated all over the dataset.
How can I extract these values into a list so I can use them in the loop?
index_values
>>
FrozenList([['0557bf98-c3e0-4955-a23f-2394635ab531', '074705a3-a96a-418c-9bfe-14c37f5c4e6f', '0f47e260-0fa2-40ba-a417-7c00ea74248c', '17342ca2-6246-4150-8080-96d6125cf2b5', '26c6c0d1-0134-4b3a-a149-61dd93afab3b', '7600be43-5d0a-49b3-a1ee-fd107db5822f', 'a07f2b0c-447c-4143-a361-d7ddbffdcc77', 'b929801c-2f32-4a95-bfc4-48a05b48ee01', 'cc912023-0113-42cd-8fe7-4df4005127c2', 'e424bd02-e188-462e-a1a6-2f4ed8fe0a2d'], ['vehicle']])
without an example its hard to judge, but I think you need
df.index.get_level_values(0).unique() # add .tolist() if you want a list
import pandas as pd
df = pd.DataFrame({'A' : [5]*5, 'B' : [6]*5})
df = df.set_index('A',append=True)
df.index.get_level_values(0).unique()
Int64Index([0, 1, 2, 3, 4], dtype='int64')
df.index.get_level_values(1).unique()
Int64Index([5], dtype='int64', name='A')
to drop duplicates from an index level use the .duplicated() method.
df[~df.index.get_level_values(1).duplicated(keep='first')]
B
A
0 5 6

Pyspark: function joining changeable number of columns

I wonder if there is a way to automatise that...
I want to make a function in which I will tell, how many columns I want to join. If I have dataFrame with 3 columns and give a parameter "number_of_columns=3", than it will join columns: 0, 1, 2. But if I have dataFrame with 7 columns and give a parameter "number_of_columns=7", than it will join columns: 0, 1, 2, 3, 4, 5, 6.
Names of the columns are always the same: From "0" to "number_of_columns-1".
Is there any way to do that? Or I must have another function if I have another number of columns to merge?
def my_function(spark_column, name_of_column):
new_spark_column = spark_column.withColumn(name_of_column, concat_ws("",
col("0").cast("Integer"),
col("1").cast("Integer"),
col("2").cast("Integer"),
col("3").cast("Integer"),
col("4").cast("Integer"),
col("5").cast("Integer"),
col("6").cast("Integer") ))
You can use a list comprehension to do this:
from pyspark.sql.functions import concat_ws, col
def my_function(spark_column, n_cols, name_of_column):
new_spark_column = spark_column.withColumn(
name_of_column,
concat_ws("", *[col(c).cast("Integer") for c in spark_column.columns[:n_cols]])
)
return new_spark_column

Python: for cycle in range over files

I am trying to create a list that takes values from different files.
I have three dataframes called for example "df1","df2","df3"
each files contains two columns with data, so for example "df1" looks like this:
0, 1
1, 4
7, 7
I want to create a list that takes a value from first row in second column in each file, so it should look like this
F=[1,value from df2,value from df3]
my try
import pandas as pd
df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)
df3 = pd.read_csv(file3)
F=[]
for i in range(3):
F.append(df{"i"}[1][0])
probably that is not how to iterate over, but I cannot figure out the correct way
You can use iloc and list comprehension
vals = [df.iloc[0, 1] for df in [df1,df2,df3]]
iloc will get value from first row (index 0) and second column (index 1). If you wanted, say, value from third row and fourth column, you'd do .iloc[2, 3] and so forth.
As suggested by #jpp, you may use iat instead:
vals = [df.iat[0, 1] for df in [df1,df2,df3]]
For difference between them, check this and this question

Categories

Resources