I am new to the python . I have the following dataframe
Document_ID OFFSET PredictedFeature word
0 0 2000 abcd
0 8 2000 is
0 16 2200 a
0 23 2200 good
0 25 315 XXYYZZ
1 0 2100 but
1 5 2100 it
1 7 2100 can
1 10 315 XXYYZZ
Now, In this dataframe what I trying to do is make a file which can be in a readable formt like ,
abcd is 2000, a good 2200
but it can 2100,
PredictedData feature offset endoffset
abcd is 2000 0 8
a good 2200 16 23
NewLine 315 25 25
but it can 2100 0 7
this type of data. where if you see I trying same sequence of predictedFeatures are coming then I am concatening same words with it's value. If there is feature 315 then I am giving a new line to it.
SO, Is there any way though which I can do this ? Any help will be appreciated.
Thnaks
IIUC, you can do groupby():
(df.groupby(['Document_ID', 'PredictedFeature'],as_index=False)
.agg({'word':(' '.join),
'OFFSET':('min','max')
})
)
Output:
Document_ID PredictedFeature word OFFSET
join min max
0 0 315 XXYYZZ 25 25
1 0 2000 abcd is 0 8
2 0 2200 a good 16 23
3 1 315 XXYYZZ 10 10
4 1 2100 but it can 0 7
Related
Imagine we have the following polars dataframe:
Feature 1
Feature 2
Labels
100
25
1
150
18
0
200
15
0
230
28
0
120
12
1
130
34
1
150
23
1
180
25
0
Now using polars we want to drop every row with Labels == 0 with 50% probability. An example output would be the following:
Feature 1
Feature 2
Labels
100
25
1
200
15
0
230
28
0
120
12
1
130
34
1
150
23
1
I think filter and sample might be handy... I have something but it is not working:
df = df.drop(df.filter(pl.col("Labels") == 0).sample(frac=0.5))
How can I make it work?
You can use polars.DataFrame.vstack:
df = (df.filter(pl.col("Labels") == 0).sample(frac=0.5)
.vstack(df.filter(pl.col("Labels") != 0))
.sample(frac=1, shuffle=True))
I have the following problem and do not know how to solve it in a perfomant way:
Input Pandas DataFrame:
timestep
article
volume
35
1
20
37
2
5
123
2
12
155
3
10
178
2
23
234
1
17
478
1
28
Output Pandas DataFrame:
timestep
volume
35
20
37
25
123
32
178
53
234
50
478
61
Calculation Example for timestep 478:
28 (last article 1 volume) + 23 (last article 2 volume) + 10 (last article 3 volume) = 61
What ist the best way to do this in pandas?
Try with ffill:
#sort if needed
df = df.sort_values("timestep")
df["volume"] = (df["volume"].where(df["article"].eq(1)).ffill().fillna(0) +
df["volume"].where(df["article"].eq(2)).ffill().fillna(0))
output = df.drop("article", axis=1)
>>> output
timestep volume
0 35 20.0
1 37 25.0
2 123 32.0
3 178 43.0
4 234 40.0
5 478 51.0
Group By article & Take last element & Sum
df.groupby(['article']).tail(1)["volume"].sum()
You can set group number of consecutive article by .cumsum(). Then get the value of previous group last item by .map() with GroupBy.last(). Finally, add volume with this previous last, as follows:
# Get group number of consecutive `article`
g = df['article'].ne(df['article'].shift()).cumsum()
# Add `volume` to previous group last
df['volume'] += g.sub(1).map(df.groupby(g)['volume'].last()).fillna(0, downcast='infer')
Result:
print(df)
timestep article volume
0 35 1 20
1 37 2 25
2 123 2 32
3 178 2 43
4 234 1 40
5 478 1 51
Breakdown of steps
Previous group last values:
g.sub(1).map(df.groupby(g)['volume'].last()).fillna(0, downcast='infer')
0 0
1 20
2 20
3 20
4 43
5 43
Name: article, dtype: int64
Try:
df["new_volume"] = (
df.loc[df["article"] != df["article"].shift(-1), "volume"]
.reindex(df.index, method='ffill')
.shift()
+ df["volume"]
).fillna(df["volume"])
df
Output:
timestep article volume new_volume
0 35 1 20 20.0
1 37 2 5 25.0
2 123 2 12 32.0
3 178 2 23 43.0
4 234 1 17 40.0
5 478 1 28 51.0
Explained:
Find the last record of each group by checking the 'article' from the previous row, then reindex that series aligning to the original dataframe and fill forward and shift to the next group with that 'volume'. And this to the current row's 'volume' and fill that first value with the original 'volume' value.
This question already has answers here:
transform dataframe according to index and labels
(2 answers)
Closed 1 year ago.
I need to pivot my data in a df like shown below based on a specific date in the YYMMDD and HHMM column "20180101 100". This specific date represents a new category of data with equal amounts of rows. I plan on replacing the repeating column names in the output with unique names. Suppose my data looks like this below.
YYMMDD HHMM BestGuess(kWh)
0 20180101 100 20
1 20180101 200 70
0 20201231 2100 50
1 20201231 2200 90
2 20201231 2300 70
3 20210101 000 40
4 20180101 100 5
5 20180101 200 7
6 20201231 2100 2
7 20201231 2200 3
8 20201231 2300 1
9 20210101 000 4
I need the new df (dfpivot) to look like this:
YYMMDD HHMM BestGuess(kWh) BestGuess(kWh)
0 20180101 100 20 5
1 20180101 200 70 7
2 20201231 2100 50 2
3 20201231 2200 90 3
4 20201231 2300 70 1
5 20210101 000 40 4
Does this suffice?
cols = ['YYMMDD', 'HHMM']
df.set_index([*cols, df.groupby(cols).cumcount()]).unstack()
BestGuess(kWh)
0 1
YYMMDD HHMM
20180101 100 20 5
200 70 7
20201231 2100 50 2
2200 90 3
2300 70 1
20210101 0 40 4
More fully baked
cols = ['YYMMDD', 'HHMM']
temp = df.set_index([*cols, df.groupby(cols).cumcount()]).unstack()
temp.columns = [f'{l0} {l1}' for l0, l1 in temp.columns]
temp.reset_index()
YYMMDD HHMM BestGuess(kWh) 0 BestGuess(kWh) 1
0 20180101 100 20 5
1 20180101 200 70 7
2 20201231 2100 50 2
3 20201231 2200 90 3
4 20201231 2300 70 1
5 20210101 0 40 4
I've been working on this problem for a little bit and am really close. Essentially, I want to create a time series of event counts by type from an event database. I'm really close. Here's what I've done so far:
Starting with an abbreviated version of my dataframe:
event_date year time_precision event_type \
0 2020-10-24 2020 1 Battles
1 2020-10-24 2020 1 Riots
2 2020-10-24 2020 1 Riots
3 2020-10-24 2020 1 Battles
4 2020-10-24 2020 2 Battles
I want the time series to be by month and year, so first I convert the dates to datetime:
nga_df.event_date = pd.to_datetime(nga_df.event_date)
Then, I want to create a time series of events by type, so I one-hot encode them:
nga_df = pd.get_dummies(nga_df, columns=['event_type'], prefix='', prefix_sep='')
Next, I need to extract the month, so that I can create monthly counts:
nga_df['month'] = nga_df.event_date.apply(lambda x: x.month)
finally, and I am so close here, I group my data by month and year and take the transpose:
conflict_series = nga_df.groupby(['year','month']).sum()
conflict_series.T
Which results in this lovely new dataframe:
year 1997 ... 2020
month 1 2 3 4 5 6 ... 5 6 7
fatalities 11 30 38 112 17 29 ... 1322 1015 619
Battles 4 4 5 13 2 2 ... 77 99 74
Explosions/Remote violence 2 1 0 0 3 0 ... 38 28 17
Protests 1 0 0 1 0 1 ... 31 83 50
Riots 3 3 4 1 4 1 ... 27 14 18
Strategic developments 1 0 0 0 0 0 ... 7 2 7
Violence against civilians 3 5 7 3 2 1 ... 135 112 88
So, I guess what I need to do is combine my index (columns after transpose) so that they are a single index. How do I do this?
The end goal is to combine this data with economic indicators to see if there is a trend, so I need both datasets to be in the same form, where the columns are monthly counts of different values.
Here's how I did it:
Step 1: flatten index:
# convert the multi-index to a flat set of tuples: (YYYY, MM)
index = conflict_series.index.to_flat_index().to_series()
Step 2: Add arbitrary but requiredend-of-month for conversion to date-time:
index = index.apply(lambda x: x + (28,))
Step 3: Convert resulting 3-tuple to date time:
index = index.apply(lambda x: datetime.date(*x))
Step 4: reset the DataFrame index:
conflict_series.set_index(index, inplace=True)
Results:
fatalities Battles Explosions/Remote violence Protests Riots \
1997-01-28 11 4 2 1 3
1997-02-28 30 4 1 0 3
1997-03-28 38 5 0 0 4
1997-04-28 112 13 0 1 1
1997-05-28 17 2 3 0 4
Strategic developments Violence against civilians total_events
1997-01-28 1 3 14
1997-02-28 0 5 13
1997-03-28 0 7 16
1997-04-28 0 3 18
1997-05-28 0 2 11
And now the plot I was looking for:
I have two data frames. One representing when an order was placed and arrived, while the other one represents the working days of the shop.
Days are taken as days of the year. i.e. 32 = 1th February.
orders = DataFrame({'placed':[100,103,104,105,108,109], 'arrived':[103,104,105,106,111,111]})
Out[25]:
arrived placed
0 103 100
1 104 103
2 105 104
3 106 105
4 111 108
5 111 109
calendar = DataFrame({'day':['100','101','102','103','104','105','106','107','108','109','110','111','112','113','114','115','116','117','118','119','120'], 'closed':[0,1,1,0,0,0,0,0,1,1,0,0,0,0,0,1,1,0,0,0,0]})
Out[21]:
closed day
0 0 100
1 1 101
2 1 102
3 0 103
4 0 104
5 0 105
6 0 106
7 0 107
8 1 108
9 1 109
10 0 110
11 0 111
12 0 112
13 0 113
14 0 114
15 1 115
16 1 116
17 0 117
18 0 118
19 0 119
20 0 120
What i want to do is to compute the difference between placed and arrived
x = orders['arrived'] - orders['placed']
Out[24]:
0 3
1 1
2 1
3 1
4 3
5 2
dtype: int64
and subtract one if any day between arrived and placed (included) was a day in which the shop was closed.
i.e. in the first row the order is placed on day 100 and arrived on day 103. the day used are 100, 101, 102, 103. the difference between 103 and 100 is 3. However, since 101 and 102 are days in which the shop is closed I want to subtract 1 for each. That is 3 -1 -1 = 1. And finally append this result on the orders df.