I am very new to python and trying to complete an appointment for uni. I've already tried googling the issue (and there may already be a solution) but could not find a solution to my problem.
I have a dataframe with values and a timestamp. It looks like this:
created_at
delta
2020-01-01
1.45
2020-01-02
0.12
2020-01-03
1.01
...
...
I want to create a new column 'sum' which summarizes all the previous values, like this:
created_at
delta
sum
2020-01-01
1.45
1.45
2020-01-02
0.12
1.57
2020-01-03
1.01
2.58
...
...
...
I want to define a method that I can use on different files (the data is spread across multiple files).
I have tried this but it doesn't work
def sum_ (data_index):
df_sum = delta_(data_index) #getting the data
y = len(df_sum)
for x in range(0,y):
df_sum['sum'].iloc[[0]] = df_sum['delta'].iloc[[0]]
df_sum['sum'].iloc[[x]] = df_sum['sum'].iloc[[x-1]] + df_sum['delta'].iloc[[x]]
return df_sum
For any help, I am very thankful.
Kind regards
Try cumsum():
df['sum'] = df['delta'].cumsum()
Use cumsum simple example
import pandas as pd
df = pd.DataFrame({'x':[1,2,3,4,5]})
df['y'] = df['x'].cumsum()
print(df)
output
x y
0 1 1
1 2 3
2 3 6
3 4 10
4 5 15
Related
Hello: I have this DataFrame (sample)
User
Timestamp
Production
A
2020-01-01
5
A
2020-06-01
7
A
2020-12-01
15
B
2020-01-01
2
B
2020-06-01
7
B
2020-12-01
9
So, I need to calculate the difference between the values and append to the DataFrame. The resulting column i'm trying to obtain should be as follows: the production column for all the periods per user.
The resulting table would be as follows (production column omitted due to problem in table editor):
User
Timestamp
Difference
A
2020-01-01
5
A
2020-06-01
2
A
2020-12-01
8
B
2020-01-01
2
B
2020-06-01
5
B
2020-12-01
2
So i tried with the .diff() function but obviously it doesn't recognize when the index is different. I then tried with a groupby() applied on the user column to then compute the diff function, but I get the same problem:
df['Difference'] = df.groupby('User')['Production'].diff()
Can someone help me out?
Thanks!
EDIT:
Made a step ahead, but still trying to figure it out:
i wrote this:
grouped = df.groupby('User')
diff = lambda x: x['Production'].shift(+1) - x['Production']
daf['diff'] = grouped.apply(diff).reset_index(0, drop=True).fillna(df['Production'])
This does the difference the way I want It, but still messes up when the User identifier changes.
I have a dataset of news articles and their associated concepts and sentiment (NLP detected) which I want to group by 2 fields: the Concept and the Source. A simplification is following:
>>> df = pandas.DataFrame({'concept_label': [1,1,2,2,3,1,1,1],
'source_uri': ['A','B','A','A','A','C','C','C'],
'sentiment_article': [0.05,0.15,-0.3,-0.2,-0.5,-0.6,-0.3,-0.4]})
concept_label source_uri sentiment_article
1 A 0.05
1 B 0.15
2 A -0.3
2 A -0.2
3 A -0.5
1 C -0.6
1 C -0.3
1 C -0.4
So I basically would want to know for the concept "Coronavirus" how often each news outlet writes about the topic and what the mean sentiment of the article is. The above df would then look like this:
mean count
concept_label source_uri
3 A -0.50 1
2 A -0.25 2
1 A 0.050 1
1 B 0.150 1
1 C -0.43 3
I am able to do the grouping with the following code (df is the pandas dataframe I'm using, concept_label is the concept, and source_uri is the news outlet):
df_grouped = df.groupby(['concept_label','source_uri'])
df_grouped['sentiment_article'].agg(['mean', 'count'])
This works just fine and gives me the values I need, however I want the groups with the highest aggregate number of "count" to be at the top. The way I tried to do that is by changing it to the following:
df_grouped = df.groupby(['concept_label','source_uri'])
df_grouped['sentiment_article'].agg(['mean', 'count']).sort_values(by=['count'], ascending=False)
However even though this sorts by the count, it breaks up the groups again. My result currently looks like this:
mean count
concept_label source_uri
3 A -0.50 1
1 A 0.050 1
1 B 0.150 1
2 A -0.25 2
1 C -0.43 3
I don't believe this is the nicest answer, but I found a way to do it.
I grouped the total list first and saved the total count per concept_label as a variable that I then merged with the existing dataframe. This way I can just sort on that column and secondary on the actual count.
#adding count column to existing table
df_grouped = df.groupby(['concept_label'])['concept_label'].agg(['count']).sort_values(by=['count'])
df_grouped.rename(columns={'count':'concept_count'}, inplace=True)
df_count = pd.merge(df, df_grouped, left_on='concept_label', right_on='concept_label')
#sorting
df_sentiment = df_count.groupby(['concept_label','source_uri','concept_count'])['sentiment_article'].agg(['mean', 'count']).sort_values(by=['concept_count','count'], ascending=False)
I have the following dataframe:
Quantity_Limit Cost Wholesaler_Code
2 9.2 1
2 9.4 1
2 7.1 2
4 10.2 1
4 4.1 2
4 2.1 3
And I would like to create the following dataframe, with only Wholesalers that offer the minimum Cost, for the same quantity limit, without using a for loop:
Quantity_Limit Cost Wholesaler_Code
2 7.1 2
4 2.1 3
I tried with:
df.groupby(["Quantity_Limit", "Wholesaler_Code"], as_index = False).agg({"Cost": "min"})
but I don't get the desired result.
Just sort Quantity_Limit, Cost and drop_duplicates
df.sort_values(['Quantity_Limit', 'Cost']).drop_duplicates(subset=['Quantity_Limit'])
Out[1121]:
Quantity_Limit Cost Wholesaler_Code
2 2 7.1 2
5 4 2.1 3
You can use transform to create a column with the minimum values and filter based on those.
df["min_cost"] = df.groupby(["Quantity_Limit"])["Cost"].min()
df[df["Cost"] == df["min_cost"]]
You can also groupby and join the result df to the original df to get the left over column:
df2 = df.groupby(['Quantity_Limit'])['Cost'].min().reset_index()
df2 = pd.merge(df2, df, on = ['Quantity_Limit', 'Cost'], how = 'left')
Output:
Quantity_Limit Cost Wholesaler_Code
0 2 7.1 2
1 4 2.1 3
import pandas as pd
#Raw data
data = [[2, 9.2,1], [2, 9.4,1], [2,7.1,1],[4, 10.2,1], [4, 4.1,2], [4,2.1,3]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Quantity_Limit', 'Cost','Wholesaler_Code'])
# Group by to find minimum using variable "Cost" . It will create a variable min_Cost
df["min_cost"] =df.groupby(["Quantity_Limit"])["Cost"].min()
Now from above output we will filter the rows where min_cost is not equal to NaN
df1=df[df["min_cost"]>0]
And you will get your desired output.
I have a DataFrame that has dates, assets, and then price/volume data. I'm trying to pull in data from 7 days ago, but the issue is that I can't use shift() because my table has missing dates in it.
date cusip price price_7daysago
1/1/2017 a 1
1/1/2017 b 2
1/2/2017 a 1.2
1/2/2017 b 2.3
1/8/2017 a 1.1 1
1/8/2017 b 2.2 2
I've tried creating a lambda function to try to use loc and timedelta to create this shifting, but I was only able to output empty numpy arrays:
def row_delta(x, df, days, colname):
if datetime.strptime(x['recorddate'], '%Y%m%d') - timedelta(days) in [datetime.strptime(x,'%Y%m%d') for x in df['recorddate'].unique().tolist()]:
return df.loc[(df['recorddate_date'] == df['recorddate_date'] - timedelta(days)) & (df['cusip'] == x['cusip']) ,colname]
else:
return 'nothing'
I also thought of doing something similar to this in order to fill in missing dates, but my issue is that I have multiple indexes, the dates and the cusips so I can't just reindex on this.
merge the DataFrame with itself while adding 7 days to the date column for the right Frame. Use the suffixes argument to name the columns appropriately.
import pandas as pd
df['date'] = pd.to_datetime(df.date)
df.merge(df.assign(date = df.date+pd.Timedelta(days=7)),
on=['date', 'cusip'],
how='left', suffixes=['', '_7daysago'])
Output: df
date cusip price price_7daysago
0 2017-01-01 a 1.0 NaN
1 2017-01-01 b 2.0 NaN
2 2017-01-02 a 1.2 NaN
3 2017-01-02 b 2.3 NaN
4 2017-01-08 a 1.1 1.0
5 2017-01-08 b 2.2 2.0
you can set date and cusip as index and use unstack and shift together
shifted = df.set_index(["date", "cusip"]).unstack().shift(7).stack()
then simply merge shifted with your original df
New to pandas, I'm trying to manage some dataframe operations with pandas where I have 4 columns on a multi-index dataframe and where I need an extra column where the value in that column would be equal to the value in one row divided by a specific row.
In my example below, I would like for each entry, to have the new column "Agg" be the result of the column "Values" for each Type (1, 2, 3) divided by the "Values" for Calc.
Date Values Agg
2016-01-01 Type 1 17 1.7
Type 2 23 2.3
Type 3 11 1.1
Calc 10 1.0
2016-01-02 Type 1 25 0.25
Type 2 39 0.39
Type 3 34 0.34
Calc 100 1.00
2016-01-03 Type 1 20 1.00
Type 2 9 0.45
Type 2 12 0.60
Calc 20 1.00
In my actual code I have a groupby "Date" and other indexes: these changes depending on the results from a query to the db.
Thanks in advance !
The code below works. I spent too much time writing it, so I have to leave it at that. Let me know if you need explanations!
def func(df1):
idx = df1.index.get_level_values(0)[0]
df1 = df1.loc[idx]
return (df1['Values'] / df1.loc['Calc']['Values']).to_frame()
df.groupby(level=0).apply(func)