I have a dataframe like the below:
dummy_dict_existing = {'Email':['joblogs#gmail.com', 'joblogs#gmail.com'],
'Ticket_Category': ['Tier1', 'Tier2'],
'Quantity_Purchased': [5,2],
'Total_Price_Paid':[1345.45, 10295.88]}
Email Ticket_Category Quantity_Purchased Total_Price_Paid
0 joblogs#gmail.com Tier1 5 1345.45
1 joblogs#gmail.com Tier2 2 10295.88
What I'm trying to do is to create 2 new columns "Tier1_Quantity_Purchased" and "Tier2_Quantity_Purchased" based on the existing dataframe, and sum the total of "Total_Price_Paid" as below:
dummy_dict_desired = {'Email':['joblogs#gmail.com'],
'Tier1_Quantity_Purchased': [5],
'Tier2_Quantity_Purchased':[2],
'Total_Price_Paid':[11641.33]}
Email Tier1_Quantity_Purchased Tier2_Quantity_Purchased Total_Price_Paid
0 joblogs#gmail.com 5 2 11641.33
Any help would be greatly appreciated. I know there is an easy way to do this, just can't figure out how without writing some silly for loop!
What you want to do is to pivot your table, and then add a column with aggregated data from the original table.
df = pd.DataFrame(dummy_dict_existing)
pivot_df = df.pivot(index='Email', columns='Ticket_Category', values='Quantity_Purchased')
pivot_df['total'] = df.groupby('Email')['Total_Price_Paid'].sum()
Email
Tier1
Tier2
total
joblogs#gmail.com
5
2
11641.33
For more details on pivoting, take a look at How can I pivot a dataframe?
import pandas as pd
dummy_dict_existing = {'Email':['joblogs#gmail.com', 'joblogs#gmail.com'],
'Ticket_Category': ['Tier1', 'Tier2'],
'Quantity_Purchased': [5,2],
'Total_Price_Paid':[1345.45, 10295.88]}
df = pd.DataFrame(dummy_dict_existing)
df2 = df[['Ticket_Category', 'Quantity_Purchased']]
df_transposed = df2.T
df_transposed.columns = ['Tier1_purchased', 'Tier2_purchased']
df_transposed = df_transposed.iloc[1:]
df_transposed = df_transposed.reset_index()
df_transposed = df_transposed[['Tier1_purchased', 'Tier2_purchased']]
df = df.groupby('Email')[['Total_Price_Paid']].sum()
df = df.reset_index()
df.join(df_transposed)
output
Related
I have a drug database saved in a SINGLE column in CSV file that I can read with Pandas. The file containts 750000 rows and its elements are devided by "///". The column also ends with "///". Seems every row is ended with ";".
I would like to split it to multiple columns in order to create structured database. Capitalized words (drug information) like "ENTRY", "NAME" etc. will be headers of these new columns.
So it has some structure, although the elements can be described by different number and sort of information. Meaning some elements will just have NaN in some cells. I have never worked with such SQL-like format, it is difficult to reproduce it as Pandas code, too. Please, see the PrtScs for more information.
An example of desired output would look like this:
df = pd.DataFrame({
"ENTRY":["001", "002", "003"],
"NAME":["water", "ibuprofen", "paralen"],
"FORMULA":["H2O","C5H16O85", "C14H24O8"],
"COMPONENT":[NaN, NaN, "paracetamol"]})
I am guessing there will be .split() involved based on CAPITALIZED words? The Python 3 code solution would be appreciated. It can help a lot of people. Thanks!
Whatever he could, he helped:
import pandas as pd
cols = ['ENTRY', 'NAME', 'FORMULA', 'COMPONENT']
# We create an additional dataframe.
dfi = pd.DataFrame()
# We read the file, get two columns and leave only the necessary lines.
df = pd.read_fwf(r'drug', header=None, names=['Key', 'Value'])
df = df[df['Key'].isin(cols)]
# To "flip" the dataframe, we first prepare an additional column
# with indexing by groups from one 'ENTRY' row to another.
dfi['Key1'] = dfi['Key'] = df[(df['Key'] == 'ENTRY')].index
dfi = dfi.set_index('Key1')
df = df.join(dfi, lsuffix='_caller', rsuffix='_other')
df.fillna(method="ffill", inplace=True)
df = df.astype({"Key_other": "Int64"})
# Change the shape of the table.
df = df.pivot(index='Key_other', columns='Key_caller', values='Value')
df = df.reindex(columns=cols)
# We clean up the resulting dataframe a little.
df['ENTRY'] = df['ENTRY'].str.split(r'\s+', expand=True)[0]
df.reset_index(drop=True, inplace=True)
pd.set_option('display.max_columns', 10)
Small code refactoring:
import pandas as pd
cols = ['ENTRY', 'NAME', 'FORMULA', 'COMPONENT']
# We read the file, get two columns and leave only the necessary lines.
df = pd.read_fwf(r'C:\Users\ф\drug\drug', header=None, names=['Key', 'Value'])
df = df[df['Key'].isin(cols)]
# To "flip" the dataframe, we first prepare an additional column
# with indexing by groups from one 'ENTRY' row to another.
df['Key_other'] = None
df.loc[(df['Key'] == 'ENTRY'), 'Key_other'] = df[(df['Key'] == 'ENTRY')].index
df['Key_other'].fillna(method="ffill", inplace=True)
# Change the shape of the table.
df = df.pivot(index='Key_other', columns='Key', values='Value')
df = df.reindex(columns=cols)
# We clean up the resulting dataframe a little.
df['ENTRY'] = df['ENTRY'].str.split(r'\s+', expand=True)[0]
df['NAME'] = df['NAME'].str.split(r'\(', expand=True)[0]
df.reset_index(drop=True, inplace=True)
pd.set_option('display.max_columns', 10)
print(df)
Key ENTRY NAME FORMULA \
0 D00001 Water H2O
1 D00002 Nadide C21H28N7O14P2
2 D00003 Oxygen O2
3 D00004 Carbon dioxide CO2
4 D00005 Flavin adenine dinucleotide C27H33N9O15P2
... ... ... ...
11983 D12452 Fostroxacitabine bralpamide hydrochloride C22H30BrN4O8P. HCl
11984 D12453 Guretolimod C24H34F3N5O4
11985 D12454 Icenticaftor C12H13F6N3O3
11986 D12455 Lirafugratinib C28H24FN7O2
11987 D12456 Lirafugratinib hydrochloride C28H24FN7O2. HCl
Key COMPONENT
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
... ...
11983 NaN
11984 NaN
11985 NaN
11986 NaN
11987 NaN
[11988 rows x 4 columns]
Need a little more to bring to mind, I leave it to your work.
I need to update values in column DF.EMAIL if have duplicates values in DF.EMAIL column to 0
generate DF
data = [('2345', 'leo#gmai.com'),
('2398', 'leo#hotmai.com'),
('2398', 'leo#hotmai.com'),
('2328', 'leo#yahoo.con'),
('3983', 'leo#yahoo.com.ar')]
serialize DF
df = sc.parallelize(data).toDF(['ID', 'EMAIL'])
# show DF
df.show()
Partial Solution
# create column with value 0 if don't have duplicates
# if have duplicates set value 1
df_join = df.join(
df.groupBy(df.columns).agg((count("*")>1).cast("int").alias("duplicate_indicator")),
on=df.columns,
how="inner"
)
# Update to 1 if have duplicates
df1 = df_join.withColumn(
"EMAIL",
when(df_join.duplicate_indicator == 1,"") \
.otherwise(df_join.EMAIL)
)
Syntax-wise, this looks more compact but yours might perform better.
df = (df.withColumn('count', count('*').over(Window.partitionBy('ID')))
.withColumn('EMAIL', when(col('count') > 1, '').otherwise(col('EMAIL'))))
I am hoping someone can help me understand why a dataframe works one way but not the other. I am relatively new to Python but do have a decent understanding of Pandas. However I can't figure out why I can't add data to the empty dataframe directly first without getting data from another dataframe. I know there are plenty of ways around this. I am just curious as to why.
Thank you for any light you may be able to shed on this.
import pandas as pd
source_df = pd.DataFrame({'iset':['1001']})
print(source_df)
column_names = ['Import Set No','col2']
df = pd.DataFrame(columns = column_names)
df.loc[:,'col2'] = 2
df.loc[:,'Import Set No'] = source_df.loc[:,'iset']
print(df)
Produces:
iset
0 1001
Empty DataFrame
Columns: [Import Set No, col2]
Index: []
And:
import pandas as pd
source_df = pd.DataFrame({'iset':['1001']})
print(source_df)
column_names = ['Import Set No','col2']
df = pd.DataFrame(columns = column_names)
df.loc[:,'Import Set No'] = source_df.loc[:,'iset']
df.loc[:,'col2'] = 2
print(df)
Produces:
iset
0 1001
Index(['Import Set No', 'col2'], dtype='object')
Import Set No col2
0 1001 2
I would like to use a function that produces multiple outputs to create multiple new columns in an existing pandas dataframe.
For example, say I have this test function which outputs 2 things:
def testfunc (TranspoId, LogId):
thing1 = TranspoId + LogId
thing2 = LogId - TranspoId
return thing1, thing2
I can give those returned outputs to 2 different variables like so:
Thing1,Thing2 = testfunc(4,28)
print(Thing1)
print(Thing2)
I tried to do this with a dataframe in the following way:
data = {'Name':['Picard','Data','Guinan'],'TranspoId':[1,2,3],'LogId':[12,14,23]}
df = pd.DataFrame(data, columns = ['Name','TranspoId','LogId'])
print(df)
df['thing1','thing2'] = df.apply(lambda row: testfunc(row.TranspoId, row.LogId), axis=1)
print(df)
What I want is something that looks like this:
data = {'Name':['Picard','Data','Guinan'],'TranspoId':[1,2,3],'LogId':[12,14,23], 'Thing1':[13,16,26], 'Thing2':[11,12,20]}
df = pd.DataFrame(data, columns=['Name','TranspoId','LogId','Thing1','Thing2'])
print(df)
In the real world that function is doing a lot of heavy lifting, and I can't afford to run it twice, once for each new variable being added to the df.
I've been hitting myself in the head with this for a few hours. Any insights would be greatly appreciated.
I believe the best way is to change the order and make a function that works with Series.
import pandas as pd
# Create function that deals with series
def testfunc (Series1, Series2):
Thing1 = Series1 + Series2
Thing2 = Series1 - Series2
return Thing1, Thing2
# Create df
data = {'Name':['Picard','Data','Guinan'],'TranspoId':[1,2,3],'LogId':[12,14,23]}
df = pd.DataFrame(data, columns = ['Name','TranspoId','LogId'])
# Apply function
Thing1,Thing2 = testfunc(df['TranspoId'],df['LogId'])
print(Thing1)
print(Thing2)
# Assign new columns
df = df.assign(Thing1 = Thing1)
df = df.assign(Thing2 = Thing2)
# print df
print(df)
Your function should return a series that calculates the new columns in one pass. Then you can use pandas.apply() to add the new fields.
import pandas as pd
df = pd.DataFrame( {'TranspoId':[1,2,3], 'LogId':[4,5,6]})
def testfunc(row):
new_cols = pd.Series([
row['TranspoId'] + row['LogId'],
row['LogId'] - row['TranspoId']])
return new_cols
df[['thing1','thing2']] = df.apply(testfunc, axis = 1)
print(df)
Output:
TranspoId LogId thing1 thing2
0 1 4 5 3
1 2 5 7 3
2 3 6 9 3
I have two data frames, one with historical data and one with some new data appended to the historical data as:
raw_data1 = {'Series_Date':['2017-03-10','2017-03-11','2017-03-12','2017-03-13','2017-03-14','2017-03-15'],'Value':[1,2,3,4,5,6]}
import pandas as pd
df_history = pd.DataFrame(raw_data1, columns = ['Series_Date','Value'])
print df_history
raw_data2 = {'Series_Date':['2017-03-10','2017-03-11','2017-03-12','2017-03-13','2017-03-14','2017-03-15','2017-03-16','2017-03-17'],'Value':[1,2,3,4,4,5,6,7]}
import pandas as pd
df_new = pd.DataFrame(raw_data2, columns = ['Series_Date','Value'])
print df_new
I want to check for all dates in df_history, if data in df_new is different. If data is different then it should append to df_check dataframe as follows:
raw_data3 = {'Series_Date':['2017-03-14','2017-03-15'],'Value_history':[5,6], 'Value_new':[4,5]}
import pandas as pd
df_check = pd.DataFrame(raw_data3, columns = ['Series_Date','Value_history','Value_new'])
print df_check
The key point is that I want to check for all dates that are in my df_history DF and check if a value is present for that day in the df_new DF and if it's same.
Simply run a merge and query filter to capture records where Value_history does not equal Value_new
df_check = pd.merge(df_history, df_new, on='Series_Date', suffixes=['_history', '_new'])\
.query('Value_history != Value_new').reset_index(drop=True)
# Series_Date Value_history Value_new
# 0 2017-03-14 5 4
# 1 2017-03-15 6 5