I have a Pandas DataFrame of the form
df = pd.DataFrame({'1':['a','b','c'], '2':['b','a','d'], '3':['0.7','0.6','0.1']}).
I'd like to add a column to this DataFrame which contains the number of times a specific row is present, without considering the order (since the first two columns are the nodes of an undirected graph). Moreover, I'd like to merge those rows that differ only for the order of the first two columns, and take the mean of the numbers in the third one. In this case, it should be
df = pd.DataFrame({'1':['a','c'], '2':['b','d'], '3':['0.65','0.1'], '4':['2','1']}).
Consider also that the DataFrame contains more than 100.000 rows.
Use -
a=df[['1','2']].values
a.sort(axis=1)
df[['1','2']] = a
df.groupby(['1','2'])['3'].agg(['count','mean']).reset_index()
Output
1 2 count mean
0 a b 2 0.65
1 c d 1 0.10
or
df[['1','2']] = df[['1','2']].sort_values(1,axis=1)
df.groupby(['1','2'])['3'].agg(['count','mean']).reset_index()
Related
I have a data frame containing hyponym and hypernym pairs extracted from StackOverflow posts. You can see an excerpt from it in the following:
0 1 2 3 4
linq query asmx web service THH 10 a linq query as an asmx web service
application bolt THH 1 my application is a bolt on data visualization...
area r time THH 1 the area of the square is r times
sql query syntax HTH 3 sql like query syntax
...
7379596 rows × 5 columns
The column 0 and the column 1 contain the hyponym and hyperonym parts of the phrases contained by the column 4. I would like to implement a filter based on statistical features, therefore I have to count all occurrences of the pairs (0, 1) columns together, all occurrences of the hyponym and hyperonym parts respectively. Pandas has a method called value_counts(), so counting the occurrences can be obtained by:
df.value_counts([0])
df.value_counts([1])
df.value_counts([0, 1])
This is nice, but the method resulted in a Pandas Series which has much fewer records than the original DataFrame, therefore, adding a new column like df[5] = df.value_counts([0, 1]) does not work.
I have found a workaround: I have created 3 Pandas Series for every occurrence type (pair, hyponym, hyperonym) and I have written a small loop to calculate a confidence score for every pair but as the original dataset is huge (more than 7 million records) this calculation is not an efficient way to do that (the calculation has not finished after 30 hours). So, the feasible and hopefully efficient solution would be using the Pandas applymap() for this purpose, but it is needed to attach columns containing the occurrences to the original DataFrame. So I would like a DataFrame like this one:
0 1 2 3 4 5 6 7
sql query anything anything a phrase 1000 800 500
sql query anything anything anotherphrase 1000 800 500
...
The column 5 is the occurences of the hyponym part (sql), the column 6 is the number of occurrences of the hyperonym part (query) and the column 7 is the occurrences of the pair (sql,
query). As you can see the pairs are the same but they are extracted from different phrases.
My question is how to do that? How can I attach occurrences as a new column to an existing DataFrame?
Here's a solution on how to map the value counts of the combination of two columns to a new column:
# Create an example DataFrame
df = pd.DataFrame({0: ["a", "a", "a", "b"], 1: ["c", "d", "d", "d"]})
# Count the paired occurrences in a new column
df["count"] = df.groupby([0,1])[0].transform('size')
Before editing, I had answered this question with a solution using value_counts and a merge. This original solution is slower and more complicated than the groupby:
# Put the value_counts in a new DataFrame, call them count
vcdf = pd.DataFrame(df[[0, 1]].value_counts(), columns=["count"])
# Merge the df with the vcs
merged = pd.merge(left=df, right=vcdf, left_on=[0, 1], right_index=True)
# Potentially sort index
merged = merged.sort_index()
The resulting DataFrame:
0 1 count
0 a c 1
1 a d 2
2 a d 2
3 b d 1
I have a filtered data frame that changes how many rows it has, depending on how the user filters it. I need to count how many times a value in one column matches with a value in another column for each row. for example: Lets say my DF is:
sample data frame
I would need a 3x3 matrix that has the sums of each time a value in column A occurs with a value in column B. The matrix would be:
sample matrix
I can determine what the dimension of the matrix should be, and what the unique values are for each column and how many times they occur, but need to find an efficient way to sum how frequently the values appear together.
I was able to achieve your desired output like this:
from collections import Counter
import pandas as pd
df = pd.DataFrame({'A':[500,500,300,400,400,300],'B':[10,10,20,10,20,30]})
inp = df.groupby('A')['B'].agg(Counter).to_frame()
out = pd.DataFrame.from_records(inp['B'].tolist(), index=inp.index).fillna(0).T
output:
A 300 400 500
20 1.0 1.0 0.0
30 1.0 0.0 0.0
10 0.0 1.0 2.0
the groupyby/Counter counts how many occurrences of values in B occur in each value in A..then the second step unpacks the column of dictionaries into a dataframe
I am trying to pivot a pandas dataframe, but the data is following a strange format that I cannot seem to pivot. The data is structured as below:
Date, Location, Action1, Quantity1, Action2, Quantity2, ... ActionN, QuantityN
<date> 1 Lights 10 CFloor 1 ... Null Null
<date2> 2 CFloor 2 CWalls 4 ... CBasement 15
<date3> 2 CWalls 7 CBasement 4 ... NUll Null
Essentially, each action will always have a quantity attached to it (which may be 0), but null actions will never have a quantity (the quantity will just be null). The format I am trying to achieve is the following:
Lights CFloor CBasement CWalls
1 10 1 0 0
2 0 2 19 11
The index of the rows becomes the location while the columns become any unique action found across the multiple activity columns. When pulling the data together, the value of each row/column is the sum of each quantity associated with the action (i.e Action1 corresponds to Quantity1). Is there a way to do this with the native pandas pivot funciton?
My current code performs a ravel across all the activity columns to get a list of all unique activities. It will also grab all the unique locations from the Location column. Once I have the unique columns, I create an empty dataframe and fill it with zeros:
Lights CFloor CBasement CWalls
1 0 0 0 0
2 0 0 0 0
I then iterate back over the old data frame with the itertuples() method (I was told it was significantly faster than iterrows()) and populate the new dataframe. This empty dataframe acts as a template that is stored in memory and filled later.
#Creates a template from the dataframe
def create_template(df):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
activities = df[act_cols]
flat_acts = activities.values.ravel('K')
unique_locations = pd.unique(df['Location'])
unique_acts = pd.unique(flat_acts)
pivot_template = pd.DataFrame(index=unique_locations, columns=unique_acts).fillna(0)
return pivot_template
#Fills the template from the dataframe
def create_pivot(df, pivot_frmt):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
quant_cols = ['Quantity01', 'Quantity02', 'Quantity03', 'Quantity04']
for row in df.itertuples():
for act, quantity in zip(act_cols, quant_cols):
act_val = getattr(row, act)
if pd.notna(act_val):
quantity_val = getattr(row, quantity)
location = getattr(row, 'Location')
pivot_frmt.loc[location, act_val] += quantity_val
return pivot_frmt
While my solution works, it is incredibly slow when dealing with a large dataset and has taken 10 seconds or more to complete this type of operation. Any help would be greatly appreciated!
After experimenting with various pandas functions, such as melt and pivot on multiple columns simulatenously, I found a solution that worked for me:
For every quantity-activity pair, I build a partial frame of the final dataset and store it in a list. Once every pair has been addressed I will end up with multiple dataframes that all have the same row counts, but potentially different column counts. I solved this issue by simply concatenating the columns and if any columns are repeated, I then sum them to get the final result.
def test_pivot(df):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
quant_cols = ['Quantity01', 'Quantity02', 'Quantity03', 'Quantity04']
dfs = []
for act, quant in zip(act_cols, quant_cols):
partial = pd.crosstab(index=df['Location'], columns=df[act], values=df[quant], aggfunc=np.sum).fillna(0)
dfs.append(partial)
finalDf = pd.concat(dfs, axis=1)
finalDf = test.groupby(finalDf.columns, axis=1).sum()
return finalDf
There are two assumptions that I make during this approach:
The indexes maintain their order across all partial dataframes
There are an equivalent number of indexes across all partial dataframes
While this is probably not the most elegant solution, it achieves the desired result and reduced the time it took to process the data by a very significant margin (from 10s ~4k rows to 0.2s ~4k rows). If anybody has a better way to deal with this type of scenario and do the process outlined above in one shot, then I would love to see your response!
I am trying to find the rows, in a very large dataframe, with the highest mean.
Reason: I scan something with laser trackers and used a "higher" point as reference to where the scan starts. I am trying to find the object placed, through out my data.
I have calculated the mean of each row with:
base = df.mean(axis=1)
base.columns = ['index','Mean']
Here is an example of the mean for each row:
0 4.407498
1 4.463597
2 4.611886
3 4.710751
4 4.742491
5 4.580945
This seems to work fine, except that it adds an index column, and gives out columns with an index of type float64.
I then tried this to locate the rows with highest mean:
moy = base.loc[base.reset_index().groupby(['index'])['Mean'].idxmax()]
This gives out tis :
index Mean
0 0 4.407498
1 1 4.463597
2 2 4.611886
3 3 4.710751
4 4 4.742491
5 5 4.580945
But it only re-index (I have now 3 columns instead of two) and does nothing else. It still shows all rows.
Here is one way without using groupby
moy=base.sort_values('Mean').tail(1)
It looks as though your data is a string or single column with a space in between your two numbers. Suggest splitting the column into two and/or using something similar to below to set the index to your specific column of interest.
import pandas as pd
df = pd.read_csv('testdata.txt', names=["Index", "Mean"], delimiter="\s+")
df = df.set_index("Index")
print(df)
Basically, how would I create a pivot table that consolidates data, where one of the columns of data it represents is calculated, say, by likelihood percentage (0.0 - 1.0) by taking the mean, and another is calculated by number ordered which sums all of them?
Right now I can specify values=... to indicate what should make up one of the two, but then when I specify the aggfunc=... I don't know how the two interoperate.
In my head I'd specify two values for values=... (likelihood percentage and number ordered) and two values for aggfunc=..., but this does not seem to be working.
You could supply to aggfunc a dictionary with column:funtion (key:value) pairs:
df = pd.DataFrame({'a':['a','a','a'],'m':[1,2,3],'s':[1,2,3]})
print df
a m s
0 a 1 1
1 a 2 2
2 a 3 3
df.pivot_table(index='a', values=['m','s'], aggfunc={'m':pd.Series.mean,'s':sum})
m s
a
a 2 6