I'm working on a df as follows:
df = pd.DataFrame({'ID': {0: 'S0001', 1: 'S0002', 2: 'S0003'},
'StartDate': {0: Timestamp('2018-01-01 00:00:00'),
1: Timestamp('2019-01-01 00:00:00'),
2: Timestamp('2019-04-01 00:00:00')},
'EndDate': {0: Timestamp('2019-01-02 00:00:00'),
1: Timestamp('2020-01-02 00:00:00'),
2: Timestamp('2020-04-01 00:00:00')}
'Color': {0: 'Blue', 1: 'Green', 2: 'Red'},
'Type': {0: 'Small', 1: 'Mid', 2: 'Mid'}})
Now I want to create a df with 366 rows between Start and End dates and I want to add the Color, Type, ID for every row between Start and End Date.
I'm doing the following whick works well:
OutputDF = pd.concat([pd.DataFrame(data = Row['ID'], index = pd.date_range(Row['StartDate'], Row['EndDate'], freq='1D', closed = 'left'), columns = ['ID']) for index, Row in df.iterrows()])
and I get a df with 2 columns SiteID and days in the range Start/End Dates.
I'm able to add the Color/Type by doing a pd.merge on 'ID' but I think there is a direct way to add the column Color and Type directly when creating the DF.
I've tried data = [Row['ID'], Row['Type'], Row['Color']] or data = Row[['ID', 'Color', 'Type']] but neither works.
Therefore, how should I do to create my dataframe but having the Color for every item for the whole 366 rows directly without requiring the merge ?
Sample of current output:
It goes on for all the days between Start/End dates for each item.
Desired output:
Thanks
Try, pd.DataFrame constructor with a dictionary for data:
pd.concat([pd.DataFrame({'ID':Row['ID'],
'Color':Row['Color'],
'Type':Row['Type']},
index = pd.date_range(Row['StartDate'],
Row['EndDate'],
freq='1D',
closed = 'left'))
for index, Row in df.iterrows()])
Related
This is how my original dataset looks like:
url boolean details
numberOfPages date
xzy.com 0 {'https://www.eltako.depdf': {'numberOfPages': 440, 'date': '2017-09-20'},'https://new.com': {'numberOfPages': 240, 'date': '2017-09-20'} }
The numberOfPages and date col is initally empty while the details col has a dictionary. I want to iterate through all rows (urls) and check their details column. For each key in the details column, I want to make a separate row and then use the numberOfPages and date values to add column values. The result should be something like this:
url boolean pdfLink numberOfPages date
xzy.com 0 https://www.eltako.depdf 440 2017-09-20
https://new.com 240 2017-09-20
I tried this but the second line gives me an error: TypeError: string indices must be integers
def arrange(df):
df=df.explode('details').reset_index(drop=True)
out=pd.DataFrame(df['details'].map(lambda x:[x[y] for y in x]).explode().tolist())
The original type of Info col was dict. I also tried changing the type to str but I still got the same error. Then I tried changing the lambda function to this:
lambda x:[y for y in x]
but the output I get is something like this:
url boolean details 0
xzy.com 0 https://www.eltako.depdf h
Nan Nan t
t
p
So basically the character of the link are being exploded into different rows. How can I fix this?
{'Company URL': {0: 'https://www.eltako.de/'},
'Potential Client': {0: 1},
'PDF Link': {0: nan},
'Number of Pages': {0: nan},
'Creation Date': {0: nan},
'Info': {0: {'https://www.eltako.de/wp-content/uploads/2020/11/Eltako_Gesamtkatalog_LowRes.pdf': {'numberOfPages': 440,
'date': '2017-09-20'}}},1: {'https:new.com: {'numberOfPages': 230,
'date': '2017-09-20'}}}}
I'm trying to use the python dedupe library to perform a fuzzy duplicate check on my mock data, but i keep getting this error:
{'Vendor': {0: 'ABC', 1: 'ABC', 2: 'TIM'},
'Doc Date': {0: '5/12/2019', 1: '5/13/2019', 2: '4/15/2019'},
'Invoice Date': {0: '5/10/2019', 1: '5/10/2019', 2: '4/10/2019'},
'Invoice Ref Num': {0: 'ABCDE56.', 1: 'ABCDE56', 2: 'RTET5SDF'},
'Invoice Amount': {0: '56', 1: '56', 2: '100'}}
IndexError: Cannot choose from an empty sequence
Here's the code that i'm using:
import pandas as pd
import pandas_dedupe
df = pd.read_csv("duptest.csv") df.columns
df = pandas_dedupe.dedupe_dataframe(df,['Vendor','Invoice Ref Num','Invoice Amount'])
Any idea what i'm doing wrong? thanks.
pandas-dedupe create a sample of observations you need to label.
The default amount of observation is equal to 30% of your dataframe.
In your case you have too few examples in you dataframe to start active learning.
If you sample_size=1 as follows:
df = pandas_dedupe.dedupe_dataframe(df,['Vendor','Invoice Ref Num','Invoice Amount'], sample_size=1)
you will be able to dedupe you data :)
I want to compare a data frame of one column with another data frame of multiple columns and return the header of the column having maximum match percentage.
I am not able to find any match functions in pandas. First data frame first column :
cars
----
swift
maruti
wagonor
hyundai
jeep
First data frame second column :
bikes
-----
RE
Ninja
Bajaj
pulsar
one column data frame :
words
---------
swift
RE
maruti
waganor
hyundai
jeep
bajaj
Desired output :
100% match header - cars
Try to use isin function of pandas DataFrame. Assuming df is your first dataframe and words is a list :
In[1]: (df.isin(words).sum()/df.shape[0])*100
Out[1]:
cars 100.0
bikes 20.0
dtype: float64
You may need to lowercase strings in your df and in the words list to avoid any casing issue.
You can first get the columns into lists:
dfCarsList = df['cars'].tolist()
dfWordsList = df['words'].tolist()
dfBikesList = df['Bikes'].tolist()
And then iterate of the list for comparision:
numberCars = sum(any(m in L for m in dfCarsList) for L in dfWordsList)
numberBikes = sum(any(m in L for m in dfBikesList) for L in dfWordsList)
The higher number you can use than for your output.
Construct a Series using numpy.in1d and ndarray.mean then call the Series.idxmax and max methods:
# Setup
df1 = pd.DataFrame({'cars': {0: 'swift', 1: 'maruti', 2: 'waganor', 3: 'hyundai', 4: 'jeep'}, 'bikes': {0: 'RE', 1: 'Ninja', 2: 'Bajaj', 3: 'pulsar', 4: np.nan}})
df2 = pd.DataFrame({'words': {0: 'swift', 1: 'RE', 2: 'maruti', 3: 'waganor', 4: 'hyundai', 5: 'jeep', 6: 'bajaj'}})
match_rates = pd.Series({col: np.in1d(df1[col], df2['words']).mean() for col in df1})
print('{:.0%} match header - {}'.format(match_rates.max(), match_rates.idxmax()))
[out]
100% match header - cars
Here is a solution with a function that returns a tuple (column_name, match_percentage) for the column with the maximum match percentage. It accepts a pandas dataframe (bikes and cars in your example) and a series (words) as arguments.
def match(df, se):
max_matches = 0
max_col = None
for col in df.columns:
# Get the number of matches in a column
n_matches = sum([1 for row in df[col] if row in se.unique()])
if n_matches > max_matches:
max_col = col
max_matches = n_matches
return max_col, max_matches/df.shape[0]
With your example, you should get the following output.
df = pd.DataFrame()
df['Cars'] = ['swift', 'maruti', 'wagonor', 'hyundai', 'jeep']
df['Bikes'] = ['RE', 'Ninja', 'Bajaj', 'pulsar', '']
se = pd.Series(['swift', 'RE', 'maruti', 'wagonor', 'hyundai', 'jeep', 'bajaj'])
In [1]: match(df, se)
Out[1]: ('Cars', 1.0)
I have a datarame with Id of orders, Id Client,Date_order and some metrics (not to much important)
I want to get number of last ID order of Client for all rows
I tried this one:
data=pd.DataFrame({'ID': [ 133853.0,155755.0,149331.0,337270.0,
775727.0,200868.0,138453.0,738497.0,666802.0,697070.0,128148.0,1042225.0,
303441.0,940515.0,143548.0],
'CLIENT':[ 235632.0,231562.0,235632.0,231562.0,734243.0,
235632.0,235632.0,734243.0,231562.0,734243.0,235632.0,734243.0,231562.0,
734243.0,235632.0],
'DATE_START': [ ('2017-09-01 00:00:00'),
('2017-10-05 00:00:00'),('2017-09-26 00:00:00'),
('2018-03-23 00:00:00'),('2018-12-21 00:00:00'),
('2017-11-23 00:00:00'),('2017-09-08 00:00:00'),
('2018-12-12 00:00:00'),('2018-11-21 00:00:00'),
('2018-12-01 00:00:00'),('2017-08-22 00:00:00'),
('2019-02-06 00:00:00'),('2018-02-20 00:00:00'),
('2019-01-20 00:00:00'),('2017-09-17 00:00:00')]})
data.groupby('CLIENT').apply(lambda x:max(x['ID']))
enter image description here
It takes into account all the IDs and displays only three rows of Client and max ID, but I need to look only among the previous ones for all rows DataFrame. Help please)
import pandas as pd
data=pd.DataFrame({
'ID': [133853.0,155755.0,149331.0,337270.0,
775727.0,200868.0,138453.0,738497.0,
666802.0,697070.0,128148.0,1042225.0,
303441.0,940515.0,143548.0],
'CLIENT':[235632.0,231562.0,235632.0,231562.0,734243.0,
235632.0,235632.0,734243.0,231562.0,734243.0,
235632.0,734243.0,231562.0,734243.0,235632.0],
'DATE_START': [('2017-09-01 00:00:00'), ('2017-10-05 00:00:00'),
('2017-09-26 00:00:00'), ('2018-03-23 00:00:00'),
('2018-12-21 00:00:00'), ('2017-11-23 00:00:00'),
('2017-09-08 00:00:00'), ('2018-12-12 00:00:00'),
('2018-11-21 00:00:00'), ('2018-12-01 00:00:00'),
('2017-08-22 00:00:00'), ('2019-02-06 00:00:00'),
('2018-02-20 00:00:00'), ('2019-01-20 00:00:00'),
('2017-09-17 00:00:00')]
})
data.groupby('CLIENT').apply(lambda df:
df[df['DATE_START'] == df['DATE_START'].max()].iloc[0][['ID', 'DATE_START']]
)
Output:
CLIENT ID DATE_START
231562.0 666802.0 2018-11-21 00:00:00
235632.0 200868.0 2017-11-23 00:00:00
734243.0 1042225.0 2019-02-06 00:00:00
Let's break this down:
1.) Group By CLIENT. this will form an iterable of dataframes, grouped by CLIENT.
2.) apply a function to each dataframe in the group with a logic (that's what the apply(lambda df: ...) part is for)
3.) for each dataframe, find the most recent DATE_START, and then subset each dataframe to show only ID with the latest DATE_START (that's what the df[df['DATE_START'] == df['DATE_START'].max()] is for).
4.) At this point, I don't know what logic you want to apply if there are multiple orders from a client on the same date. In this case, I used the first match (.iloc[0]).
5.) And then I return the ID and the DATE_START.
6.) pandas will then understand that you want the logic you applied to each dataframe in the iterable to be combined row-wise, which is why the output is such.
Let me know if this is what you're looking for.q
data['id_last_order']= data.sort_values('DATE_START').groupby('CLIENT')['ID'].transform(lambda x: x.shift())
or with creation function
def select_last_order_id(row):
df = data[(data['CLIENT']==row['CLIENT'])&(data['DATE_START']<row['DATE_START'])]
try:
value = df.groupby(by=['ID','CLIENT'],as_index=False,sort = False).agg('max')['ID'].values[0]
except Exception:
value = None
return(value)
data['id_last_order'] = data.apply(select_last_order_id,axis=1)
I am trying to organise some data into groups based on the contents of certain columns. The current code I have is dependent on loops and I would like to vectorise instead to improve performance. I know this is the way forwards with pandas and whilst I can vectorise some problems I am really struggling with this one.
What I need to do is group the data by the ClientNumber and link the genuine and incomplete rows so that for each Clientnumber all the genuine rows have a different process ID and the incomplete rows are given the same process ID as the nearest genuine row which has a StartDate that is greater than the StartDate for the incomplete row (Essentially, incomplete rows should be connected to a genuine if present and once a genuine row is found it should close that grouping and treat future rows as separate events). Then I must be able to set a process Start date for each row which is equal to the lowest Start date within that group of processID and mark the last row (the one with the greatest StartDate) with a ProcessCount in a separate column.
Apologies for my lack of descriptive ability here, hopefully the code (written in Python 3.6) I have so far will better explain my desired outcome. The code works but as you can see relies on nested loops which I don't like. I have tried researching around to find out how to vectorise this but I am struggling to get my head around the concept for this problem.
Any help you can provide me with in straightening out the loops in this code would be most appreciated and really help me better understand how to apply this to other tasks going forwards.
Data
df_dict = {'ClientNumber': {0: 1234, 1: 1234, 2: 1234, 3: 123, 4: 123, 5: 123, 6: 12, 7: 12, 8: 1}, 'Genuine_Incomplete': {0: 'Incomplete', 1: 'Genuine', 2: 'Genuine', 3: 'Incomplete', 4: 'Incomplete', 5: 'Genuine', 6: 'Incomplete', 7: 'Incomplete', 8: 'Genuine'}, 'StartDate': {0: Timestamp('2018-01-01 00:00:00'), 1: Timestamp('2018-01-05 00:00:00'), 2: Timestamp('2018-03-01 00:00:00'), 3: Timestamp('2018-01-01 00:00:00'), 4: Timestamp('2018-01-03 00:00:00'), 5: Timestamp('2018-01-10 00:00:00'), 6: Timestamp('2018-01-01 00:00:00'), 7: Timestamp('2018-06-02 00:00:00'), 8: Timestamp('2018-01-01 00:00:00')}}
df = pd.DataFrame(data=df_dict)
df["ID"] = df.index
df["Process_Start_Date"] = np.nan
df["ProcessCode"] = np.nan
df["ProcessCount"] = np.nan
grouped_df = df.groupby('ClientNumber')
for key, item in grouped_df:
newdf = grouped_df.get_group(key)
newdf.sort_values(by=["StartDate"], inplace=True)
c = 1
for i in newdf.iterrows():
i = i[0]
GI = df.loc[i, "Genuine_Incomplete"]
proc_code = "{}_{}".format(df.loc[i, "ClientNumber"],c)
df.loc[i, "ProcessCode"] = proc_code
if GI == "Genuine":
c += 1
grouped_df = df.groupby('ProcessCode')
for key, item in grouped_df:
newdf = grouped_df.get_group(key)
newdf.sort_values(by=["StartDate"], inplace=True)
df.loc[newdf.ID.iat[-1], "ProcessCount"] = 1
for i in newdf.iterrows():
i = i[0]
df.loc[i, "Process_Start_Date"] = df.loc[newdf.ID.iat[0], "StartDate"]
Note - You may have noticed my use of the df["ID"] which is just a copy of the index. I know this is not good practice but I couldn't work out how to set values from other columns using the index. Any suggestions for doing this are also very welcome.