Panda Python - dividing a column by 100 (then rounding by 2.dp) - python

I have been manipulating some data frames, but unfortunately I have two percentage columns, one in the format '61.72' and the other '0.62'.
I want to just divide the column with the percentages in the '61.72' format by 100 then round it to 2.dp so it is consistent with the data frame.
Is there an easy way of doing this?
My data frame has two columns, one called 'A' and the other 'B', I want to format 'B'.
Many thanks!

You can use div with round:
df = pd.DataFrame({'A':[61.75, 10.25], 'B':[0.62, 0.45]})
print (df)
A B
0 61.75 0.62
1 10.25 0.45
df['A'] = df['A'].div(100).round(2)
#same as
#df['A'] = (df['A'] / 100).round(2)
print (df)
A B
0 0.62 0.62
1 0.10 0.45

This question have already got answered but here is another solution which is significantly faster and standard one.
df = pd.DataFrame({'x':[10, 3.50], 'y':[30.1, 50.8]})
print (df)
>> x y
0 10.0 30.1
1 3.5 50.8
df = df.loc[:].div(100).round(2)
print (df)
>> x y
0 0.10 0.30
1 0.03 0.50
why prefer this solution??
well, this warning is enough answer - "A value is trying to be set on a copy of a slice from a DataFrame if you use df['A'] so, Try using .loc[row_indexer,col_indexer] = value instead."
Moreover, check this for more understanding https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Related

add new column with the summary of all previous values (python)

I am very new to python and trying to complete an appointment for uni. I've already tried googling the issue (and there may already be a solution) but could not find a solution to my problem.
I have a dataframe with values and a timestamp. It looks like this:
created_at
delta
2020-01-01
1.45
2020-01-02
0.12
2020-01-03
1.01
...
...
I want to create a new column 'sum' which summarizes all the previous values, like this:
created_at
delta
sum
2020-01-01
1.45
1.45
2020-01-02
0.12
1.57
2020-01-03
1.01
2.58
...
...
...
I want to define a method that I can use on different files (the data is spread across multiple files).
I have tried this but it doesn't work
def sum_ (data_index):
df_sum = delta_(data_index) #getting the data
y = len(df_sum)
for x in range(0,y):
df_sum['sum'].iloc[[0]] = df_sum['delta'].iloc[[0]]
df_sum['sum'].iloc[[x]] = df_sum['sum'].iloc[[x-1]] + df_sum['delta'].iloc[[x]]
return df_sum
For any help, I am very thankful.
Kind regards
Try cumsum():
df['sum'] = df['delta'].cumsum()
Use cumsum simple example
import pandas as pd
df = pd.DataFrame({'x':[1,2,3,4,5]})
df['y'] = df['x'].cumsum()
print(df)
output
x y
0 1 1
1 2 3
2 3 6
3 4 10
4 5 15

How to locate and replace values in dataframe based on some criteria

I would like to locate all places when in Col2 there is a change in value (for ex. change from A to C) and then modify value from Col1 (corresponding to row when the change happens, so when A -> C then it will be value in the same row as C) by dividing subtraction current value and previous value by two (in this example will be 1 + (1.5-1)/2 = 1.25.
Output table is result of replacing all that occurrences in whole table
How I can achieve that ?
Col1
Col2
1
A
1.5
C
2.0
A
2.5
A
3.0
D
3.5
D
OUTPUT:
Col1
Col2
1
A
1.25
C
1.75
A
2.5
A
2.75
D
3.5
D
Use np.where and series holding values of your formula
solution = df.Col1.shift() + ((df.Col1 - df.Col1.shift()) / 2)
df['Col1'] = np.where(~df.Col2.eq(df.Col2.shift()), solution.fillna(df.Col1), df.Col1)

Pandas groupby and then sort based on groups

I have a dataset of news articles and their associated concepts and sentiment (NLP detected) which I want to group by 2 fields: the Concept and the Source. A simplification is following:
>>> df = pandas.DataFrame({'concept_label': [1,1,2,2,3,1,1,1],
'source_uri': ['A','B','A','A','A','C','C','C'],
'sentiment_article': [0.05,0.15,-0.3,-0.2,-0.5,-0.6,-0.3,-0.4]})
concept_label source_uri sentiment_article
1 A 0.05
1 B 0.15
2 A -0.3
2 A -0.2
3 A -0.5
1 C -0.6
1 C -0.3
1 C -0.4
So I basically would want to know for the concept "Coronavirus" how often each news outlet writes about the topic and what the mean sentiment of the article is. The above df would then look like this:
mean count
concept_label source_uri
3 A -0.50 1
2 A -0.25 2
1 A 0.050 1
1 B 0.150 1
1 C -0.43 3
I am able to do the grouping with the following code (df is the pandas dataframe I'm using, concept_label is the concept, and source_uri is the news outlet):
df_grouped = df.groupby(['concept_label','source_uri'])
df_grouped['sentiment_article'].agg(['mean', 'count'])
This works just fine and gives me the values I need, however I want the groups with the highest aggregate number of "count" to be at the top. The way I tried to do that is by changing it to the following:
df_grouped = df.groupby(['concept_label','source_uri'])
df_grouped['sentiment_article'].agg(['mean', 'count']).sort_values(by=['count'], ascending=False)
However even though this sorts by the count, it breaks up the groups again. My result currently looks like this:
mean count
concept_label source_uri
3 A -0.50 1
1 A 0.050 1
1 B 0.150 1
2 A -0.25 2
1 C -0.43 3
I don't believe this is the nicest answer, but I found a way to do it.
I grouped the total list first and saved the total count per concept_label as a variable that I then merged with the existing dataframe. This way I can just sort on that column and secondary on the actual count.
#adding count column to existing table
df_grouped = df.groupby(['concept_label'])['concept_label'].agg(['count']).sort_values(by=['count'])
df_grouped.rename(columns={'count':'concept_count'}, inplace=True)
df_count = pd.merge(df, df_grouped, left_on='concept_label', right_on='concept_label')
#sorting
df_sentiment = df_count.groupby(['concept_label','source_uri','concept_count'])['sentiment_article'].agg(['mean', 'count']).sort_values(by=['concept_count','count'], ascending=False)

Find index of first row closest to value in pandas DataFrame

So I have a dataframe containing multiple columns. For each column, I would like to get the index of the first row that is nearly equal to a user specified number (e.g. within 0.05 of desired number). The dataframe looks kinda like this:
ix col1 col2 col3
0 nan 0.2 1.04
1 0.98 nan 1.5
2 1.7 1.03 1.91
3 1.02 1.42 0.97
Say I want the first row that is nearly equal to 1.0, I would expect the result to be:
index 1 for col1 (not index 3 even though they are mathematically equally close to 1.0)
index 2 for col2
index 0 for col3 (not index 3 even though 0.97 is closer to 1 than 1.04)
I've tried an approach that makes use of argsort():
df.iloc[(df.col1-1.0).abs().argsort()[:1]]
This would, according to other topics, give me the index of the row in col1 with the value closest to 1.0. However, it returns only a dataframe full of nans. I would also imagine this method does not give the first value close to 1 it encounters per column, but rather the value that is closest to 1.
Can anyone help me with this?
Use DataFrame.sub for difference, convert to absolute values by abs, compare by lt (<) and last get index of first value by DataFrame.idxmax:
a = df.sub(1).abs().lt(0.05).idxmax()
print (a)
col1 1
col2 2
col3 0
dtype: int64
But for more general solution, working if failed boolean mask (no value is in tolerance) is appended new column filled by Trues with name NaN:
print (df)
col1 col2 col3
ix
0 NaN 0.20 1.07
1 0.98 NaN 1.50
2 1.70 1.03 1.91
3 1.02 1.42 0.87
s = pd.Series([True] * len(df.columns), index=df.columns, name=np.nan)
a = df.sub(1).abs().lt(0.05).append(s).idxmax()
print (a)
col1 1.0
col2 2.0
col3 NaN
dtype: float64
Suppose, you have some tolerance value tol for the nearly
match threshold. You can create a mask dataframe for
values below the threshold and use first_valid_index()
on each column to get the index of first match occurence.
tol = 0.05
mask = df[(df - 1).abs() < tol]
for col in df:
print(col, mask[col].first_valid_index())

Splitting array values in dataframe into new dataframe - python

I have a pandas dataframe with a variable that is an array of arrays. I would like to create a new dataframe from this variable.
My current dataframe 'fruits' looks like this...
Id Name Color price_trend
1 apple red [['1420848000','1.25'],['1440201600','1.35'],['1443830400','1.52']]
2 lemon yellow [['1403740800','0.32'],['1422057600','0.25']]
What I would like is a new dataframe from the 'price_trend' column that looks like this...
Id date price
1 1420848000 1.25
1 1440201600 1.35
1 1443830400 1.52
2 1403740800 0.32
2 1422057600 0.25
Thanks for the advice!
A groupby+apply should do the trick.
def f(group):
row = group.irow(0)
ids = [row['Id'] for v in row['price_trend']]
dates = [v[0] for v in row['price_trend']]
prices = [v[1] for v in row['price_trend']]
return DataFrame({'Id':ids, 'date': dates, 'price': prices})
In[7]: df.groupby('Id', group_keys=False).apply(f)
Out[7]:
Id date price
0 1 1420848000 1.25
1 1 1440201600 1.35
2 1 1443830400 1.52
0 2 1403740800 0.32
1 2 1422057600 0.25
Edit:
To filter out bad data (for instance, a price_trend column having value [['None']]), one option is to use pandas boolean indexing.
criterion = df['price_trend'].map(lambda x: len(x) > 0 and all(len(pair) == 2 for pair in x))
df[criterion].groupby('Id', group_keys=False).apply(f)

Categories

Resources