I have a pandas dataframe with a variable that is an array of arrays. I would like to create a new dataframe from this variable.
My current dataframe 'fruits' looks like this...
Id Name Color price_trend
1 apple red [['1420848000','1.25'],['1440201600','1.35'],['1443830400','1.52']]
2 lemon yellow [['1403740800','0.32'],['1422057600','0.25']]
What I would like is a new dataframe from the 'price_trend' column that looks like this...
Id date price
1 1420848000 1.25
1 1440201600 1.35
1 1443830400 1.52
2 1403740800 0.32
2 1422057600 0.25
Thanks for the advice!
A groupby+apply should do the trick.
def f(group):
row = group.irow(0)
ids = [row['Id'] for v in row['price_trend']]
dates = [v[0] for v in row['price_trend']]
prices = [v[1] for v in row['price_trend']]
return DataFrame({'Id':ids, 'date': dates, 'price': prices})
In[7]: df.groupby('Id', group_keys=False).apply(f)
Out[7]:
Id date price
0 1 1420848000 1.25
1 1 1440201600 1.35
2 1 1443830400 1.52
0 2 1403740800 0.32
1 2 1422057600 0.25
Edit:
To filter out bad data (for instance, a price_trend column having value [['None']]), one option is to use pandas boolean indexing.
criterion = df['price_trend'].map(lambda x: len(x) > 0 and all(len(pair) == 2 for pair in x))
df[criterion].groupby('Id', group_keys=False).apply(f)
Related
I have a pandas dataframe below:
df
name value1 value2 otherstuff1 otherstuff2
0 Jack 1 1 1.19 2.39
1 Jack 1 2 1.19 2.39
2 Luke 0 1 1.08 1.08
3 Mark 0 1 3.45 3.45
4 Luke 1 0 1.08 1.08
Same name will have the same value for otherstuff1 and otherstuff2.
I'm trying to group by column name and sum both columns value1 and value2. (Not sum value1 with value2!!! But sum them individually in each column.)
Expecting to get result below:
newdf
name value1 value2 otherstuff1 otherstuff2
0 Jack 2 3 1.19 2.39
1 Luke 1 1 1.08 1.08
2 Mark 0 1 3.45 3.45
I've tried
newdf = df.groupby(['name'], as_index=False).sum()
which groups by name and sums up both value1 and value2 columns correctly, but ends up dropping columns otherstuff1 and otherstuff2.
You should specify what pandas must do with the other columns. In your case, I think you want to keep one row, regardless of its position within the group.
This could be done with agg on a group. agg accepts a parameter that specifies what operation should be performed for each column.
df.groupby(['name'], as_index=False).agg({'value1': 'sum', 'value2': 'sum', 'otherstuff1': 'first', 'otherstuff2': 'first'})
Something like ?(Assuming you have same otherstuff1 and otherstuff2 under the same name )
df.groupby(['name','otherstuff1','otherstuff2'],as_index=False).sum()
Out[121]:
name otherstuff1 otherstuff2 value1 value2
0 Jack 1.19 2.39 2 3
1 Luke 1.08 1.08 1 1
2 Mark 3.45 3.45 0 1
The key in the answer above is actually the as_index=False, otherwise all the columns in the list get used in the index.
p_summ = p.groupby( attributes_list, as_index=False ).agg( {'AMT':sum })
These solutions are great, but when you have to many columns you do not want to type all of the column names. So here is what I came up with:
column_map = {col: "first" for col in df.columns}
column_map["col_name1"] = "sum"
column_map["col_name2"] = lambda x: set(x) # it can also be a function or lambda
now you can simply do
df.groupby(["col_to_group"], as_index=False).aggreagate(column_map)
I am very new to python and trying to complete an appointment for uni. I've already tried googling the issue (and there may already be a solution) but could not find a solution to my problem.
I have a dataframe with values and a timestamp. It looks like this:
created_at
delta
2020-01-01
1.45
2020-01-02
0.12
2020-01-03
1.01
...
...
I want to create a new column 'sum' which summarizes all the previous values, like this:
created_at
delta
sum
2020-01-01
1.45
1.45
2020-01-02
0.12
1.57
2020-01-03
1.01
2.58
...
...
...
I want to define a method that I can use on different files (the data is spread across multiple files).
I have tried this but it doesn't work
def sum_ (data_index):
df_sum = delta_(data_index) #getting the data
y = len(df_sum)
for x in range(0,y):
df_sum['sum'].iloc[[0]] = df_sum['delta'].iloc[[0]]
df_sum['sum'].iloc[[x]] = df_sum['sum'].iloc[[x-1]] + df_sum['delta'].iloc[[x]]
return df_sum
For any help, I am very thankful.
Kind regards
Try cumsum():
df['sum'] = df['delta'].cumsum()
Use cumsum simple example
import pandas as pd
df = pd.DataFrame({'x':[1,2,3,4,5]})
df['y'] = df['x'].cumsum()
print(df)
output
x y
0 1 1
1 2 3
2 3 6
3 4 10
4 5 15
I have a dataset of news articles and their associated concepts and sentiment (NLP detected) which I want to group by 2 fields: the Concept and the Source. A simplification is following:
>>> df = pandas.DataFrame({'concept_label': [1,1,2,2,3,1,1,1],
'source_uri': ['A','B','A','A','A','C','C','C'],
'sentiment_article': [0.05,0.15,-0.3,-0.2,-0.5,-0.6,-0.3,-0.4]})
concept_label source_uri sentiment_article
1 A 0.05
1 B 0.15
2 A -0.3
2 A -0.2
3 A -0.5
1 C -0.6
1 C -0.3
1 C -0.4
So I basically would want to know for the concept "Coronavirus" how often each news outlet writes about the topic and what the mean sentiment of the article is. The above df would then look like this:
mean count
concept_label source_uri
3 A -0.50 1
2 A -0.25 2
1 A 0.050 1
1 B 0.150 1
1 C -0.43 3
I am able to do the grouping with the following code (df is the pandas dataframe I'm using, concept_label is the concept, and source_uri is the news outlet):
df_grouped = df.groupby(['concept_label','source_uri'])
df_grouped['sentiment_article'].agg(['mean', 'count'])
This works just fine and gives me the values I need, however I want the groups with the highest aggregate number of "count" to be at the top. The way I tried to do that is by changing it to the following:
df_grouped = df.groupby(['concept_label','source_uri'])
df_grouped['sentiment_article'].agg(['mean', 'count']).sort_values(by=['count'], ascending=False)
However even though this sorts by the count, it breaks up the groups again. My result currently looks like this:
mean count
concept_label source_uri
3 A -0.50 1
1 A 0.050 1
1 B 0.150 1
2 A -0.25 2
1 C -0.43 3
I don't believe this is the nicest answer, but I found a way to do it.
I grouped the total list first and saved the total count per concept_label as a variable that I then merged with the existing dataframe. This way I can just sort on that column and secondary on the actual count.
#adding count column to existing table
df_grouped = df.groupby(['concept_label'])['concept_label'].agg(['count']).sort_values(by=['count'])
df_grouped.rename(columns={'count':'concept_count'}, inplace=True)
df_count = pd.merge(df, df_grouped, left_on='concept_label', right_on='concept_label')
#sorting
df_sentiment = df_count.groupby(['concept_label','source_uri','concept_count'])['sentiment_article'].agg(['mean', 'count']).sort_values(by=['concept_count','count'], ascending=False)
For a specific value: -999.00, I am trying to check if it exists in any column of my dataframe. If -999.00 exists in any column then I want to create a new column and replace only -999.00 values with 1.00. For example, below is my dataframe and the output I am trying to get.
Dataframe:
MMC MET_lep MASS_Vis Pt_H Y
0 138.70 51.65 97.82 0.91 0
1 160.93 68.78 103.23 -999.00 0
2 -999.00 162.17 125.95 -999.00 0
3 143.90 81.41 80.94 -999.00 1
4 175.86 16.91 134.80 -999.00 0
Output I am trying to get:
MMC MMC_mv MET_lep MASS_Vis Pt_H Pt_H_mv Y
0 138.70 138.70 51.65 97.82 0.91 0.91 0
1 160.93 160.93 68.78 103.23 -999.00 1.00 0
2 -999.00 1.00 162.17 125.95 -999.00 1.00 0
3 143.90 143.90 81.41 80.94 -999.00 1.00 1
4 175.86 175.86 16.91 134.80 -999.00 1.00 0
Below is my code but it does not do anything nor gives any error:
for column in df.columns.tolist():
if (-999.00 in df[column]) == True:
df[column+'_mv'] = df.column.apply(lambda x: 1.00 if x == -999.00 else x)
print(df.head(3))
Thanks. I appreciate all the help. Please let me know if any additional information is needed.
You can do something like this:
# get column names which contain -999
cols = (df == -999).any()[lambda x: x].index
# create new columns for these columns and replace -999 with -1
df[cols + "_mv"] = df[cols].where(df[cols] != -999, 1)
df
Or if you'd like to write a for loop and update:
for col in df.columns:
if (df[col] == -999).any():
df[col+"_mv"] = df[col].replace(-999, 1)
BTW your solution doesn't work because of two reasons:
1) -999 in df[column] doesn't check if values contain -999 as you expected but index, a series is more like a dictionary in this case;
2) since column is a string in the for loop, you can't access the column with df.column which is interpreting column as an attribute, you need df[column] instead;
I have a df which looks like that:
col1 col2 now previous target
A 1 1-1-2015 4-1-2014 0.2
B 0 2-1-2015 2-5-2014 0.33
A 0 3-1-2013 3-9-2011 0.1
A 1 1-1-2014 4-9-2011 1.7
A 1 31-12-2014 4-9-2014 1.9
I am grouping the df by col1 and col2, and for each member of each group, I want to sum the target values, only of other group members, that their now date value, is smaller(before) than the current member's previous date value.
For example for:
col1 col2 now previous target
A 1 1-1-2015 4-1-2014 0.2
I want to sum the target values of:
col1 col2 now previous target
A 0 3-1-2013 3-9-2011 0.1
A 1 1-1-2014 4-9-2011 1.7
to eventually have:
col1 col2 now previous target sum
A 1 1-1-2015 4-1-2014 0.2 1.8
Interesting problem, I've got something that I think may work.
Although, slow time complexity of Worst case: O(n**3) and Best case: O(n**2).
Setup data
import pandas as pd
import numpy as np
import io
datastring = io.StringIO(
"""
col1 col2 now previous target
A 1 1-1-2015 4-1-2014 0.2
B 0 2-1-2015 2-5-2014 0.33
A 0 3-1-2013 3-9-2011 0.1
A 1 1-1-2014 4-9-2011 1.7
A 1 31-12-2014 4-9-2014 1.9
C 1 31-12-2014 4-9-2014 1.9
""")
# arguments for pandas.read_csv
kwargs = {
"sep": "\s+", # specifies that it's a space separated file
"parse_dates": [2,3], # parse "now" and "previous" as dates
}
# read the csv into a pandas dataframe
df = pd.read_csv(datastring, **kwargs)
Pseudo code for algorithm
For each row:
For each *other* row:
If "now" of *other* row comes before "previous" of row
Then add *other* rows "target" to "sum" of row
Run the algorithm
First start by setting up a function f(), that is to be applied over all the groups computed by df.groupby(["col1","col2"]). All that f() does is try to implement the pseudo code above.
def f(df):
_sum = np.zeros(len(df))
# represent the desired columns of the sub-dataframe as a numpy object
data = df[["now","previous","target"]].values
# loop through the rows in the sub-dataframe, df
for i, outer_row in enumerate(data):
# for each row, loop through all the rows again
for j, inner_row in enumerate(data):
# skip iteration if outer loop row is equal to the inner loop row
if i==j: continue
# get the dates from rows
outer_prev = outer_row[1]
inner_now = inner_row[0]
# if the "previous" datetime of the outer loop is greater than
# the "now" datetime of the inner loop, then add "target" to
# to the cumulative sum
if outer_prev > inner_now:
_sum[i] += inner_row[2]
# add a new column for this new "sum" that we calculated
df["sum"] = _sum
return df
Now just apply f() over the grouped data.
done = df.groupby(["col1","col2"]).apply(f)
Output
col1 col2 now previous target sum
0 A 1 2015-01-01 2014-04-01 0.20 1.7
1 B 0 2015-02-01 2014-02-05 0.33 0.0
2 A 0 2013-03-01 2011-03-09 0.10 0.0
3 A 1 2014-01-01 2011-04-09 1.70 0.0
4 A 1 2014-12-31 2014-04-09 1.90 1.7