Pandas groupby and then sort based on groups - python

I have a dataset of news articles and their associated concepts and sentiment (NLP detected) which I want to group by 2 fields: the Concept and the Source. A simplification is following:
>>> df = pandas.DataFrame({'concept_label': [1,1,2,2,3,1,1,1],
'source_uri': ['A','B','A','A','A','C','C','C'],
'sentiment_article': [0.05,0.15,-0.3,-0.2,-0.5,-0.6,-0.3,-0.4]})
concept_label source_uri sentiment_article
1 A 0.05
1 B 0.15
2 A -0.3
2 A -0.2
3 A -0.5
1 C -0.6
1 C -0.3
1 C -0.4
So I basically would want to know for the concept "Coronavirus" how often each news outlet writes about the topic and what the mean sentiment of the article is. The above df would then look like this:
mean count
concept_label source_uri
3 A -0.50 1
2 A -0.25 2
1 A 0.050 1
1 B 0.150 1
1 C -0.43 3
I am able to do the grouping with the following code (df is the pandas dataframe I'm using, concept_label is the concept, and source_uri is the news outlet):
df_grouped = df.groupby(['concept_label','source_uri'])
df_grouped['sentiment_article'].agg(['mean', 'count'])
This works just fine and gives me the values I need, however I want the groups with the highest aggregate number of "count" to be at the top. The way I tried to do that is by changing it to the following:
df_grouped = df.groupby(['concept_label','source_uri'])
df_grouped['sentiment_article'].agg(['mean', 'count']).sort_values(by=['count'], ascending=False)
However even though this sorts by the count, it breaks up the groups again. My result currently looks like this:
mean count
concept_label source_uri
3 A -0.50 1
1 A 0.050 1
1 B 0.150 1
2 A -0.25 2
1 C -0.43 3

I don't believe this is the nicest answer, but I found a way to do it.
I grouped the total list first and saved the total count per concept_label as a variable that I then merged with the existing dataframe. This way I can just sort on that column and secondary on the actual count.
#adding count column to existing table
df_grouped = df.groupby(['concept_label'])['concept_label'].agg(['count']).sort_values(by=['count'])
df_grouped.rename(columns={'count':'concept_count'}, inplace=True)
df_count = pd.merge(df, df_grouped, left_on='concept_label', right_on='concept_label')
#sorting
df_sentiment = df_count.groupby(['concept_label','source_uri','concept_count'])['sentiment_article'].agg(['mean', 'count']).sort_values(by=['concept_count','count'], ascending=False)

Related

How to find how many decimal places there are in a column using a pandas dataframe

I have a large dataframe that contains a column with different amount of decimal places. I want to create something like Decimal Places in my example. The goal of this column is to count
df
ColA ColB DecimalPlaces
A .03 2
B .003 3
C 10.01 2
D 11.1 1
I tried Below but I can't get it to work for a whole column on a dataframe
d = decimal.Decimal('56.43256436')
d.as_tuple().exponent
here is one way to do it
split the number at decimal, then take its length
df['decimals']=df['ColB'].astype('str').str.split('.', expand=True)[1].apply(lambda x: len(x))
df
ColA ColB DecimalPlaces decimals
0 A 0.030 2 2
1 B 0.003 3 3
2 C 10.010 2 2
3 D 11.100 1 1
I've got basically the same as Naveed here, but, well, slightly different:
df['decimals'] = df['ColB'].map(lambda x: str(x).split('.')[1]).apply(len)
no idea what's faster / more efficient.
You can do:
df['ColB'].astype(str).str.extract(r'\.(.*)', expand=False).str.len()

How to keep the best row in a pandas dataframe satisfying multiple conditions with groupby

I have a pandas dataframe that looks like this:
experiment replicate count fdr
0 a 1 10 0.01
1 a 1 8 0
2 a 1 9 0
I would like to group by experiment and replicate and keep the row that has the minimum fdr, but in cases where there are multiple rows with the same minimum fdr, use the row with the maximum count.
So my expected output would be
experiment replicate count fdr
2 a 1 9 0
From reading other posts I can do this based on a single condition with something like:
df.groupby(['experiment', 'replicate']).fdr.transform(min)
but I can't figure out how to do it with two conditions. I believe I need apply instead of transform, but I'm still struggling with getting something to work.
You may need to sort your dataframe in a very specific way. If your last False in the ascending parameter was changed to True, then you would get a different answer, so you should make sure it is sorted like that.
Then, use can use your groupby with idxmin()[0] to return the index minimum value ([0] gets rid of series index so you just get raw value), and then filter the dataframe by that.
df = df.sort_values(['experiment', 'replicate', 'fdr', 'count'],
ascending=[True, True, True, False])
df[df.index == df.groupby(['experiment', 'replicate']).fdr.idxmin()[0]]
# Per #wwii's comment a slightly cleaner way and likely most syntactical
df.loc[df.groupby(['experiment', 'replicate']).fdr.idxmin(),:]
Out[1]:
experiment replicate count fdr
2 a 1 9 0.0
You could first get the minimum, compare with each row, then get the index with max count and finally filter for that row:
cond1 = df.groupby(["experiment", "replicate"]).fdr.transform("min")
row_with_max_count = df.loc[df.fdr.eq(cond1), "count"].idxmax()
df.loc[[row_with_max_count]]
experiment replicate count fdr
2 a 1 9 0.0
import pandas as pd
data = { 'experiment' : ['a', 'a', 'a'],
'replicate' : [1, 1, 1],
'count' : [10,8,9],
'fdr' : [0.01,0,0],}
df = pd.DataFrame(data)
gives
experiment replicate count fdr
0 a 1 10 0.01
1 a 1 8 0.00
2 a 1 9 0.00
df.groupby(['experiment', 'replicate']).min('fdr')
count fdr
experiment replicate
a 1 8 0.0

Panda Python - dividing a column by 100 (then rounding by 2.dp)

I have been manipulating some data frames, but unfortunately I have two percentage columns, one in the format '61.72' and the other '0.62'.
I want to just divide the column with the percentages in the '61.72' format by 100 then round it to 2.dp so it is consistent with the data frame.
Is there an easy way of doing this?
My data frame has two columns, one called 'A' and the other 'B', I want to format 'B'.
Many thanks!
You can use div with round:
df = pd.DataFrame({'A':[61.75, 10.25], 'B':[0.62, 0.45]})
print (df)
A B
0 61.75 0.62
1 10.25 0.45
df['A'] = df['A'].div(100).round(2)
#same as
#df['A'] = (df['A'] / 100).round(2)
print (df)
A B
0 0.62 0.62
1 0.10 0.45
This question have already got answered but here is another solution which is significantly faster and standard one.
df = pd.DataFrame({'x':[10, 3.50], 'y':[30.1, 50.8]})
print (df)
>> x y
0 10.0 30.1
1 3.5 50.8
df = df.loc[:].div(100).round(2)
print (df)
>> x y
0 0.10 0.30
1 0.03 0.50
why prefer this solution??
well, this warning is enough answer - "A value is trying to be set on a copy of a slice from a DataFrame if you use df['A'] so, Try using .loc[row_indexer,col_indexer] = value instead."
Moreover, check this for more understanding https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

find mean and corr of 10,000 columns in pyspark Dataframe

I have DF with 10K columns and 70Million rows. I want to calculate the mean and corr of 10K columns. I did below code but it wont work due to code size 64K issue (https://issues.apache.org/jira/browse/SPARK-16845)
Data:
region dept week sal val1 val2 val3 ... val10000
US CS 1 1 2 1 1 ... 2
US CS 2 1.5 2 3 1 ... 2
US CS 3 1 2 2 2.1 2
US ELE 1 1.1 2 2 2.1 2
US ELE 2 2.1 2 2 2.1 2
US ELE 3 1 2 1 2 .... 2
UE CS 1 2 2 1 2 .... 2
Code:
aggList = [func.mean(col) for col in df.columns] #exclude keys
df2= df.groupBy('region', 'dept').agg(*aggList)
code 2
aggList = [func.corr('sal', col).alias(col) for col in df.columns] #exclude keys
df2 = df.groupBy('region', 'dept', 'week').agg(*aggList)
this fails. Is there any alternative way to overcome this bug? and any one tried DF with 10K columns?. Is there any suggestion on performance improvement?
We also ran into the 64KB issue, but in a where clause, which is filed under another bug report. What we used as a workaround, is simply, to do the operations/transformations in several steps.
In your case, this would mean, that you don't do all the aggregatens in one step. Instead loop over the relevant columns in an outer operation:
Use select to create a temporary dataframe, which just contains columns you need for the operation.
Use the groupBy and agg like you did, except not for a list of aggregations, but just for on (or two, you could combine the mean and corr.
After you received references to all temporary dataframes, use withColumn to append the aggregated columns from the temporary dataframes to a result df.
Due to the lazy evaluation of a Spark DAG, this is of course slower as doing it in one operation. But it should evaluate the whole analysis in one run.

Splitting array values in dataframe into new dataframe - python

I have a pandas dataframe with a variable that is an array of arrays. I would like to create a new dataframe from this variable.
My current dataframe 'fruits' looks like this...
Id Name Color price_trend
1 apple red [['1420848000','1.25'],['1440201600','1.35'],['1443830400','1.52']]
2 lemon yellow [['1403740800','0.32'],['1422057600','0.25']]
What I would like is a new dataframe from the 'price_trend' column that looks like this...
Id date price
1 1420848000 1.25
1 1440201600 1.35
1 1443830400 1.52
2 1403740800 0.32
2 1422057600 0.25
Thanks for the advice!
A groupby+apply should do the trick.
def f(group):
row = group.irow(0)
ids = [row['Id'] for v in row['price_trend']]
dates = [v[0] for v in row['price_trend']]
prices = [v[1] for v in row['price_trend']]
return DataFrame({'Id':ids, 'date': dates, 'price': prices})
In[7]: df.groupby('Id', group_keys=False).apply(f)
Out[7]:
Id date price
0 1 1420848000 1.25
1 1 1440201600 1.35
2 1 1443830400 1.52
0 2 1403740800 0.32
1 2 1422057600 0.25
Edit:
To filter out bad data (for instance, a price_trend column having value [['None']]), one option is to use pandas boolean indexing.
criterion = df['price_trend'].map(lambda x: len(x) > 0 and all(len(pair) == 2 for pair in x))
df[criterion].groupby('Id', group_keys=False).apply(f)

Categories

Resources