I have a Pandas DataFrame which contains an ID, Code and Date. For certain codes I would like to fill subsequent appearances of the ID, based on the date, with a determined set of missing codes. I would also like to know the first appearance of the code against the respective ID.
Example as follows, NB: missing codes are A and B (only codes A and B carry over):
import pandas as pd
d = {'ID': [1, 2, 1, 2, 3, 1], 'date': ['2017-03-22', '2017-03-21', '2017-03-23', '2017-03-24', '2017-03-28', '2017-03-28'], 'Code': ['A, C', 'A', 'B, C', 'E, D', 'A', 'C']}
df = pd.DataFrame(data=d)
# only A and B codes carry over
df
The target dataframe would ideally look as follows:
import pandas as pd
d = {'ID': [1, 2, 1, 2, 3, 1], 'date': ['2017-03-22', '2017-03-21', '2017-03-24', '2017-03-22', '2017-03-28', '2017-03-28'], 'Code': ['A, C', 'A', 'B, C', 'E, D', 'A', 'C'], 'Missing_code': ['', '', 'A', 'A', '', 'A, B'], 'First_code_date': ['', '', '2017-03-22', '2017-03-21', '', '2017-03-23, 2017-03-24']}
df = pd.DataFrame(data=d)
df
Note I am not fussy on how the 'First_code_date' looks providing it is dynamic as the code length may increase or decrease.
If the example is not clear please let me know and I will adjust.
Thank you for help.
Related
I have a data set of orders with the item ordered, the quantity ordered, and the box it was shipped in. I'd like to find the possible order combinations of [Box Type, Item, Quantity] and assign each order an identifier for its combination for further analysis. Ideally, the output would look like this:
d2 = {'Order Number': [1, 2, 3], 'Order Type': [1, 2, 1]}
pd.DataFrame(d2)
Where grouping by 'Order Type' would provide a count of the unique order types.
The problem is that each box is assigned a unique code necessary to distinguish whether a box held multiple items. In the example data below "box_id" = 3 shows that the second "Box A" contains two items, 3 and 4. While this field is needed to
import pandas as pd
d = {'Order Number': [1, 2, 2, 2, 3], 'Box_id': [1, 2, 3, 3, 4], 'Box Type': ['Box A', 'Box B', 'Box A', 'Box A', 'Box A'],
'Item': ['A1', 'A2', 'A2', 'A3', 'A1'], 'Quantity': [2, 4, 2, 2, 2]}
pd.DataFrame(d)
I have tried representing each order as a tuple of its [Box type, Item, Quantity] data and using those tuples to capture counts with a default dictionary, but that output is understandably messy to interpret and difficult to match with orders afterwards.
from collections import defaultdict
combinations = defaultdict(int)
Order1 = ((('Box A', 'A1', 2),),)
Order2 = (('Box B', 'A2', 4), (('Box A', 'A2', 2),('Box A', 'A3', 2)))
Order3 = ((('Box A', 'A1', 2),),)
combinations[Order1] += 1
combinations[Order2] += 1
combinations[Order3] += 1
# Should result in
combinations = {((('Box A', 'A1', 2),),): 2
(('Box B', 'A2', 4), (('Box A', 'A2', 2),('Box A', 'A3', 2))): 1}
Is there an easier way to get a representation of unique order combinations and their counts?
python dataframe
I want to delete the last character if it is number.
from current dataframe
data = {'d':['AAA2', 'BB 2', 'C', 'DDD ', 'EEEEEEE)', 'FFF ()', np.nan, '123456']}
df = pd.DataFrame(data)
to new dataframe
data = {'d':['AAA2', 'BB 2', 'C', 'DDD ', 'EEEEEEE)', 'FFF ()', np.nan, '123456'],
'expected': ['AAA', 'BB', 'C', 'DDD', 'EEEEEEE)', 'FFF (', np.nan, '12345']}
df = pd.DataFrame(data)
df
ex
Using .str.replace:
df['d'] = df['d'].str.replace(r'(\d)$','',regex=True)
I have a dataset with more than 6k data.
I want to know how to count missing data and non-numeric data(error) simultaneously, and then using a histogram to plot the occurrence.
I use this code to find out the missing data and error data but I can only filter one subset each time. I don't know how to sum them up. The data type of a, b, and c is the object. For Id and d are the int and float.
How can this be done programmatically? And then using the histogram to show the occurrence.
df[pd.to_numeric(df['a'], errors='coerce').isnull()]
df = pd.DataFrame({'Id':[1, 2, 3, 4, 5],
'a': [1, 2, good, 'bad', NaN],
'b': [0.1, worse, NaN, better, 0.5],
'c': ['2.5', 'best', '6.5', 'NaN', '10.5'],
'd': ['10', '20', '30', '40', '50']})
Setup
df = pd.DataFrame({'A' : ['', np.nan, 3], 'B' : ['amount', 5, 3]})
df_error = (pd.to_numeric(df.stack(dropna=False), errors='coerce')
.isna()
.map({True : 'error', False : 'not error'})
.groupby(level=1)
.value_counts()
.unstack())
df_error.plot(kind='bar')
I have an interface which gets a str with the condition to be executed.
Is it possible to directly pass this str in my pandas loc?
example:
df:
{'col A': [1, 2, 3, 4],
'col B': ['a', 'b', 'c', 'd']}
entry_str = "col A == 2"
df = df.loc[entry_str]
just use df.query('col A == 2')
Let's suppose I have the following DataFrame:
import pandas as pd
df = pd.DataFrame({'label': ['a', 'a', 'b', 'b', 'a', 'b', 'c', 'c', 'a', 'a'],
'numbers': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
'arbitrarydata': [False] * 10})
I want to assign a value to the arbitrarydata column according to the values in both of the other colums. A naive approach would be as follows:
for _, grp in df.groupby(('label', 'numbers')):
grp.arbitrarydata = pd.np.random.rand()
Naturally, this doesn't propagate changes back to df. Is there a way to modify a group such that changes are reflected in the original DataFrame ?
Try using transform, e.g.:
df['arbitrarydata'] = df.groupby(('label', 'numbers')).transform(lambda x: np.random.rand())