How to save duplicates only? - python

I made code to remove duplicates from col in my xlsx file.
import pandas as pd
from openpyxl.workbook import Workbook
def delete_duplicates(nazov_suboru, cielovy_subor,riadok):
data = pd.read_excel(nazov_suboru)
print("chvilelenku pockaj")
data.drop_duplicates(subset=[riadok], keep=False, inplace=True)
data.to_excel(cielovy_subor, index=False)
print("done")
It save the unique data. But I need the opposite. To only save the duplicated ones. Cant figure it out. Any ideas please /

data = data[data.duplicated(subset=[riadok], keep=False)]
would keep the duplicated rows.
See pandas.DataFrame.duplicated

Related

How to read excel data only after a string is found but without using skiprows

I want to read the data after the string "Executed Trade". I want to do that dynamically. Not using "skiprows". I know openpyxl can be an option. But I am still struggling to do so. Could you guys please help me with that thing cause I have many files like the one is shown in image.
Try:
import pandas as pd
#change the Excel filename and the two mentions of 'col1' for whatever the column is
df = pd.read_excel('dictatorem.xlsx')
df = df.iloc[df.col1[df.col1 == 'Executed Trades'].index.tolist()[0]+1:]
df.columns = df.iloc[0]
df = df[1:]
df = df.reset_index(drop=True)
print(df)
Example input/output:

Read, select and rearrange columns in Pandas

I have a best practice question. Today i learned how to Read and write files in Pandas. How to create a Table, how to add a column and row and how to drop them.
I have an excel file with the following content:
I create a new Column "Price_average" and I average "Price_min" and "Price_max" and output it as output_1.xlsx
#!/usr/bin/env python3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import xlrd
df = pd.read_excel('original.xlsx')
print (df)
df['Price_average'] = (df.Price_min + df.Price_max)/2
df.to_excel('output_1.xlsx', sheet_name='sheet1', index=False)
print (df)
I then prop the columns "Price_min" and "price_max" with:
df = df.drop(['Price_min', 'Price_max'], axis=1)
And lets say I want to Create This table now:
I can either delete "Age" and "Price_average" and and swap "email" with "brand" or can I simply select the Columns I want to create a new spreadsheet?
Whats the best and cleanest way to do it? To subtract the unwanted Columns from the file and rearrange and if wanted rename the columns or Pick and choose the needed columns and create a new file with them in the right order. Any suggestions? And what's the cleanest way to solve it?
You can try this,
selected = df[['Age', 'Price_average', 'Email', 'Brand']]
If you want to change column names,
renamed = selected.rename(columns={'Brand': 'brand', 'Email':'email'})

Merging two excel files using python with mismatching sizes

I have been trying to merge those two excel files.
Those files are already ready to be joined just as you can see in my image example.
I have tried the solutions from the answer here using pandas and xlwt, but I still can not save both in one file.
Desired result is:
P.s: the two data frames may have mismatch columns and rows which should just be ignored. I am looking for a way to paste one in another using panda.
how can I approach this problem? Thank you in advance,
import pandas as pd
import numpy as np
df = pd.read_excel('main.xlsx')
df.index = np.arange(1, len(df) + 1)
df1 = pd.read_excel('alt.xlsx', header=None, names=list(df))
for i in list(df):
if any(pd.isnull(df[i])):
df[i] = df1[i]
print(df)
df.to_excel("<filename>.xlsx", index=False)
Try this. The main.xlsx is your first excel file while the alt.xlsx is the second one.

Unique Values Excel Column, no missing info in rows - Python

Currently self-teaching Python and running into some issues. My challenge requires me to count the number of unique values in a column of an excel spreadsheet in which the rows have no missing values. Here is what I've got so far but I can't seem to get it to work:
import xlrd
import pandas as pd
workbook = xlrd.open_workbook("*name of excel spreadsheet*")
worksheet = workbook.sheet_by_name("*name of specific sheet*")
pd.value_counts(df.*name of specific column*)
s = pd.value_counts(df.*name of specific column*)
s1 = pd.Series({'nunique': len(s), 'unique values': s.index.tolist()})
s.append(s1)
print(s)
Thanks in advance for any help.
Use the built in to find the unique in the columns:
sharing an example with you:
import pandas as pd
df=pd.DataFrame(columns=["a","b"])
df["a"]=[1,3,3,3,4]
df["b"]=[1,2,2,3,4]
print(df["a"].unique())
will give the following result:
[1 3 4]
So u can store it as a list to a variable if you like, with:
l_of_unique_vals=df["a"].unique()
and find its length or do anything as you like
df = pd.read_excel("nameoffile.xlsx", sheet_name=name_of_sheet_you_are_loading)
#in the line above we are reading the file in a pandas dataframe and giving it a name df
df["column you want to find vals from"].unique()
First you can use Pandas read_exel and then unique such as #Inder suggested.
import pandas as pd
df = pd.read_exel('name_of_your_file.xlsx')
print(df['columns'].unique())
See more here.

To Re arrange the columns of dataframe from csv and add format to empty cells

I need to read a csv file in python and then re arrange the columns of csv and make a new dataframe made of the rearranged columns
I tried using list, but it might work slow..
Any alternative using numpy or pandas?
Edit:
I am rearranging the row using df.reindex()
I am currently doing this and thus exporting the df after leaving 4 rows blank
df_reindexed.to_excel(writer, sheet_name='Sheet1',startrow=4, index=False)
I need to add format and text to cells in those top 4 rows, corresponding to the column name in the following rows.
I know I can use iloc, but is there anyway to do it so that i can select a cell above a cell with specified name?
import pandas as pd
# read a CSV with pandas
src = "your/path"
old_df = pd.read_csv(src, sep=",")
# the columns that you want
desired_cols = ['c1','c2']
# pandas will return a new df only with the columns that you want
new_df = old_df[desired_cols]
Another way to do it is:
desired_cols = ['c1', 'c2', 'c3']
df_final = df_final.reindex(columns = desired_cols)

Categories

Resources