Merge common values of certain columns in an excel sheet

Merge common values of certain columns in an excel sheet - python

Consider the DataFrame df
df = pd.DataFrame({'Name': ['Tesla','Tesla','Tesla','Toyota','Ford','Ford','Ford','BMW','BMW','BMW','Mercedes','Mercedes','Mercedes'],
'Type': ['Model X','Model X','Model X','Corolla','Bronco','Bronco','Mustang','3 Series','3 Series','7 Series','C-Class','C-Class','S-Class'],
'Year': [2015, 2015, 2015, 2017, 2018, 2018, 2020, 2015, 2015, 2017, 2018, 2018, 2020],
'Price': [85000, 90000, 95000, 20000, 35000, 35000, 45000, 40000, 40000, 65000, 50000, 50000, 75000],
'Color': ['White','White','White','Red','Blue','Blue','Yellow','Silver','Silver','Black','White','White','Black']
})
I am trying to merge cells in excel which has common values consecutively for a DataFrame df columns using the below mergecells function, however, when I open the excel file after merging it says the excel file has recovered some of the values.
def mergecells(df, columntomerge, sheetname, writer):
df1 = df.index.to_series().groupby(df[columntomerge]).agg(['first', 'last']).reset_index()
df1 = df1.sort_values("first").reset_index()
first_last_rows = df1.set_index('first')['last'].to_dict()
merge_ranges = {}
for key, value in first_last_rows.items():
if df.loc[key, columnname] in ["Alpha", "-"] or key == value:
continue
merge_ranges[df.loc[key, columnname]] = (
key+1, df.columns.get_loc(columnname), value+1, df.columns.get_loc(columnname))
wb = writer.book
ws = writer.sheets[sheetname]
mf = wb.add_format({'align': 'center', 'valign': 'vcenter'})
for name, merge_range in merge_ranges.items():
ws.merge_range(*merge_range, name, mf)
for col in df.columns:
mergecells(df,col,'Trial',writer)
But when I call the above merge function with the code above, I am getting the error as the below image
The Type column, Name column and Price column are correctly merged, However the Year and color are completely wrong
Expected Output

The problem lies in the groupby. You have disjoint intervals while grouping by colors or year: White is found in rows [1, 2, 3] but also [11, 12]. You should consider consecutive values in a column. more_itertools.consecutive_groups can help you with that:
from more_itertools import consecutive_groups
sheetname='Sheet1'
with pd.ExcelWriter("test.xlsx") as writer:
df.to_excel(writer, sheet_name=sheetname, index=False)
wb = writer.book
ws = writer.sheets[sheetname]
mf = wb.add_format({'align': 'center', 'valign': 'vcenter'})
for j, col in enumerate(df.columns):
ws.set_column(j, j, 12, mf)
for val in df[col].unique():
idx = df[(df[col]==val) & (df[col]==df[col].shift(1))].index # indices of the rows where the value is the same as the previous row
for seg in consecutive_groups(idx):
l = list(seg)
ws.merge_range(l[0], j, l[-1]+1, j, val, mf)

Related

About Pandas Dataframe

I have a question related to Pandas.
In df1 I have a data frame with the id of each seller and their respective names.
In df2 I have the id of the salesmen and their respective sales.
I would like to have in the df2, two new columns with the first name and last names of the salesmen.
PS. in df2 one of the sales is shared between two vendors.
import pandas as pd
vendors = {'first_name': ['Montgomery', 'Dagmar', 'Reeba', 'Shalom', 'Broddy', 'Aurelia'],
'last_name': ['Humes', 'Elstow', 'Wattisham', 'Alen', 'Keningham', 'Brechin'],
'id_vendor': [127, 241, 329, 333, 212, 233]}
sales = {'id_vendor': [['127'], ['241'], ['329, 333'], ['212'], ['233']],
'sales': [1233, 25000, 8555, 4333, 3222]}
df1 = pd.DataFrame(vendors)
df2 = pd.DataFrame(sales)
I attach the code. any suggestions?`
Thank you in advance.

You can merge df1 with df2 with the exploded id_vendors column and use DataFrame.GroupBy.agg when grouping by sales to obtain the columns as you want:
transform_names = lambda x: ', '.join(list(x))
res = (df1.merge(df2.explode('id_vendor')).
groupby('sales').
agg({'first_name': transform_names, 'last_name': transform_names,
'id_vendor': list})
)
print(res)
first_name last_name id_vendor
sales
1233 Montgomery Humes [127]
3222 Aurelia Brechin [233]
4333 Broddy Keningham [212]
8555 Reeba, Shalom Wattisham, Alen [329, 333]
25000 Dagmar Elstow [241]
Note:
In your example, id_vendors in df2 is populated by lists of strings, but since id_vendor in df1 is of integer type, I assume that it was a typo. If id_vendors is indeed containing lists of strings, you need to also convert the strings to integers:
transform_names = lambda x: ', '.join(list(x))
# Notice the .astype(int) call.
res = (df1.merge(df2.explode('id_vendor').astype(int)).
groupby('sales').
agg({'first_name': transform_names, 'last_name': transform_names,
'id_vendor': list})
)
print(res)

How to get the missing record Row number and column names using python?

Using python and pandas, I would like to achieve the output below. Whenever there are Null or Nan values present in the file then it needs to print the both row number and column name.
import pandas as pd
# List of Tuples
employees = [('Stuti', 'Null', 'Varanasi', 20000),
('Saumya', 'NAN', 'NAN', 35000),
('Saumya', 32, 'Delhi', 30000),
('Aaditya', 40, 'Dehradun', 24000),
('NAN', 45, 'Delhi', 70000)
]
# Create a DataFrame object from list
df = pd.DataFrame(employees,
columns =['Name', 'Age',
'City', 'Salary'])
print(df)
Expected Output:
Row 0: column Age missing
Row 1: Column Age, column City missing
Row 4: Column Name missing

Try isin to mask the missing values, then matrix multiply # with the columns to concatenate them:
s = df.isin(['Null','NAN'])
missing = s.loc[s.any(1)] # ('column ' + df.columns + ', ')
for r, val in missing.str[:-2].items():
print(f'Row {r}: {val} is missing')
Output:
Row 0: column Age is missing
Row 1: column Age, column City is missing
Row 4: column Name is missing

Convert a muti-valued dict into a pandas dataframe

I want to convert this dict into a pandas dataframe where each key becomes a column and values in the list become the rows:
my_dict:
{'Last updated': ['2021-05-18T15:24:19.000Z', '2021-05-18T15:24:19.000Z'],
'Symbol': ['BTC', 'BNB', 'XRP', 'ADA', 'BUSD'],
'Name': ['Bitcoin', 'Binance Coin', 'XRP', 'Cardano', 'Binance USD'],
'Rank': [1, 3, 7, 4, 25],
}
The lists in my_dict can also have some missing values, which should appear as NaNs in dataframe.
This is how I'm currently trying to append it into my dataframe:
df = pd.DataFrame(columns = ['Last updated',
'Symbol',
'Name',
'Rank',]
df = df.append(my_dict, ignore_index=True)
#print(df)
df.to_excel(r'\walletframe.xlsx', index = False, header = True)
But my output only has a single row containing all the values.

The answer was pretty simple, instead of using
df = df.append(my_dict)
I used
df = pd.DataFrame.from_dict(my_dict).T
Which transposes the dataframe so it doesn't has any missing values for columns.
Credits to #Ank who helped me find the solution!

Convert a dictionary of a list of dictionaries to pandas DataFrame

I pulled a list of historical option price of AAPL from the RobinHoood function robin_stocks.get_option_historicals(). The data was returned in a form of dictional of list of dictionary as shown below.
I am having difficulties to convert the below object (named historicalData) into a DataFrame. Can someone please help?
historicalData = {'data_points': [{'begins_at': '2020-10-05T13:30:00Z',
'open_price': '1.430000',
'close_price': '1.430000',
'high_price': '1.430000',
'low_price': '1.430000',
'volume': 0,
'session': 'reg',
'interpolated': False},
{'begins_at': '2020-10-05T13:40:00Z',
'open_price': '1.430000',
'close_price': '1.340000',
'high_price': '1.440000',
'low_price': '1.320000',
'volume': 0,
'session': 'reg',
'interpolated': False}],
'open_time': '0001-01-01T00:00:00Z',
'open_price': '0.000000',
'previous_close_time': '0001-01-01T00:00:00Z',
'previous_close_price': '0.000000',
'interval': '10minute',
'span': 'week',
'bounds': 'regular',
'id': '22b49380-8c50-4c76-8fb1-a4d06058f91e',
'instrument': 'https://api.robinhood.com/options/instruments/22b49380-8c50-4c76-8fb1-a4d06058f91e/'}
I tried the below code code but that didn't help:
import pandas as pd
df = pd.DataFrame(historicalData)
df

You didn't write that you want only data_points (as in the
other answer), so I assume that you want your whole dictionary
converted to a DataFrame.
To do it, start with your code:
df = pd.DataFrame(historicalData)
It creates a DataFrame, with data_points "exploded" to
consecutive rows, but they are still dictionaries.
Then rename open_price column to open_price_all:
df.rename(columns={'open_price': 'open_price_all'}, inplace=True)
The reason is to avoid duplicated column names after join
to be performed soon (data_points contain also open_price
attribute and I want the corresponding column from data_points
to "inherit" this name).
The next step is to create a temporary DataFrame - a split of
dictionaries in data_points to individual columns:
wrk = df.data_points.apply(pd.Series)
Print wrk to see the result.
And the last step is to join df with wrk and drop
data_points column (not needed any more, since it was
split into separate columns):
result = df.join(wrk).drop(columns=['data_points'])

This is pretty easy to solve with the below. I have chucked the dataframe to a list via list comprehension
import pandas as pd
df_list = [pd.DataFrame(dic.items(), columns=['Parameters', 'Value']) for dic in historicalData['data_points']]
You then could do:
df_list[0]
which will yield
Parameters Value
0 begins_at 2020-10-05T13:30:00Z
1 open_price 1.430000
2 close_price 1.430000
3 high_price 1.430000
4 low_price 1.430000
5 volume 0
6 session reg
7 interpolated False

Pandas reading and sorting a file's content

I am reading a file from SIPRI. It reads in to pandas and dataframe is created and I can display it but when I try to sort by a column, I get a KeyError. Here is the code and the error:
import os
import pandas as pd
os.chdir('C:\\Users\\Student\\Documents')
#Find the top 20 countries in military spending by sorting
data = pd.read_excel('SIPRI-Milex-data-1949-2016.xls',
header = 0, index_col = 0, sheetname = 'Current USD')
data.sort_values(by = '2016', ascending = False)
KeyError: '2016'

You get the key error because the column '2016' is not present in the dataframe. Based on the excel file its in the integer form. Cleaning of data must be done in your dataframe to sort the things.
You can skip the top 5 rows and the bottom 8 rows to get the countries, then replace all the string and missing values with NaN. The following code will help you get that.
data = pd.read_excel('./SIPRI-Milex-data-1949-2016.xlsx', header = 0, index_col = 0, sheetname = 'Current USD',skiprows=5,skip_footer = 8)
data = data.replace(r'\s+', np.nan, regex=True).replace('xxx',np.nan)
new_df = data.sort_values(2016,ascending=False)
top_20 = new_df[:20].index.tolist()
Output:
['USA', 'China, P.R.', 'Russian Federation', 'Saudi Arabia', 'India', 'France', 'UK', 'Japan', 'Germany', 'Korea, South', 'Italy', 'Australia', 'Brazil', 'Israel', 'Canada', 'Spain', 'Turkey', 'Iran', 'Algeria', 'Pakistan']

Well this could be helpful, I guess:
data = pd.read_excel('SIPRI-Milex-data-1949-2016.xlsx', skiprows=5, index_col = 0, sheetname = 'Current USD')
data.dropna(inplace=True)
data.sort_values(by=2016, ascending=False, inplace=True)
And to get Top20 you can use:
data[data[2016].apply(lambda x: isinstance(x, (int, float)))][:20]

I downloaded the file and looks like the 2016 is not a column itself so you need to modify the dataframe a bit so as to change the row of country to be the header.
The next thing is you need to say data.sort_values(by = 2016, ascending = False). treat the column name as an integer instead of a string.
data = pd.read_excel('SIPRI-Milex-data-1949-2016.xlsx',
header = 0, index_col = 0, sheetname = 'Current USD')
data = data[4:]
data.columns = data.iloc[0]
data.sort_values(by =2016, ascending = False)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merge common values of certain columns in an excel sheet - python

Related

About Pandas Dataframe

How to get the missing record Row number and column names using python?

Convert a muti-valued dict into a pandas dataframe

Convert a dictionary of a list of dictionaries to pandas DataFrame

Pandas reading and sorting a file's content

Categories

Resources