Merging two excel files using python with mismatching sizes - python

I have been trying to merge those two excel files.
Those files are already ready to be joined just as you can see in my image example.
I have tried the solutions from the answer here using pandas and xlwt, but I still can not save both in one file.
Desired result is:
P.s: the two data frames may have mismatch columns and rows which should just be ignored. I am looking for a way to paste one in another using panda.
how can I approach this problem? Thank you in advance,

import pandas as pd
import numpy as np
df = pd.read_excel('main.xlsx')
df.index = np.arange(1, len(df) + 1)
df1 = pd.read_excel('alt.xlsx', header=None, names=list(df))
for i in list(df):
if any(pd.isnull(df[i])):
df[i] = df1[i]
print(df)
df.to_excel("<filename>.xlsx", index=False)
Try this. The main.xlsx is your first excel file while the alt.xlsx is the second one.

Related

How to read excel data only after a string is found but without using skiprows

I want to read the data after the string "Executed Trade". I want to do that dynamically. Not using "skiprows". I know openpyxl can be an option. But I am still struggling to do so. Could you guys please help me with that thing cause I have many files like the one is shown in image.
Try:
import pandas as pd
#change the Excel filename and the two mentions of 'col1' for whatever the column is
df = pd.read_excel('dictatorem.xlsx')
df = df.iloc[df.col1[df.col1 == 'Executed Trades'].index.tolist()[0]+1:]
df.columns = df.iloc[0]
df = df[1:]
df = df.reset_index(drop=True)
print(df)
Example input/output:

How to save duplicates only?

I made code to remove duplicates from col in my xlsx file.
import pandas as pd
from openpyxl.workbook import Workbook
def delete_duplicates(nazov_suboru, cielovy_subor,riadok):
data = pd.read_excel(nazov_suboru)
print("chvilelenku pockaj")
data.drop_duplicates(subset=[riadok], keep=False, inplace=True)
data.to_excel(cielovy_subor, index=False)
print("done")
It save the unique data. But I need the opposite. To only save the duplicated ones. Cant figure it out. Any ideas please /
data = data[data.duplicated(subset=[riadok], keep=False)]
would keep the duplicated rows.
See pandas.DataFrame.duplicated

reading multiple files with glob duplicates columns

I'm trying read many txt files into my data frame and this code works below. However, it duplicates some of my columns, not all of them. I couldn't find a solution. What can I do to prevent this?
import pandas as pd
import glob
dfs = pd.DataFrame(pd.concat(map(functools.partial(pd.read_csv, sep='\t', low_memory=False),
glob.glob(r'/folder/*.txt')), sort=False))
Let's say my data should look like this:
enter image description here
But it looks like this:
enter image description here
I don't want my columns to be duplicated.
Could you give us a bit more information? Especially the output of dfs.columns would be useful. I suspect there could be some extra spaces in your column names which would cause pandas to differ between those.
Also you could try dask for that:
import dask.dataframe as dd
dfs = dd.read_csv(r'/folder/*.text, sep='\t').compute()
is a bit simpler and should give the same result
It is important to think about the concat process as having two possible outcomes. By choosing the axis, you can add new columns like the example (I) below or as new rows illustrated in example (II). pd.concat lets you do this by setting the axis to either 0 (rows) or 1 (columns).
Read more in the excellent documentation: concat
Example I:
import pandas as pd
import glob
pd.concat([pd.read_csv(f) for f in glob.glob(r'/folder/*.txt')], axis=1)
Example II:
pd.concat([pd.read_csv(f) for f in glob.glob(r'/folder/*.txt')], axis=0)

Unique Values Excel Column, no missing info in rows - Python

Currently self-teaching Python and running into some issues. My challenge requires me to count the number of unique values in a column of an excel spreadsheet in which the rows have no missing values. Here is what I've got so far but I can't seem to get it to work:
import xlrd
import pandas as pd
workbook = xlrd.open_workbook("*name of excel spreadsheet*")
worksheet = workbook.sheet_by_name("*name of specific sheet*")
pd.value_counts(df.*name of specific column*)
s = pd.value_counts(df.*name of specific column*)
s1 = pd.Series({'nunique': len(s), 'unique values': s.index.tolist()})
s.append(s1)
print(s)
Thanks in advance for any help.
Use the built in to find the unique in the columns:
sharing an example with you:
import pandas as pd
df=pd.DataFrame(columns=["a","b"])
df["a"]=[1,3,3,3,4]
df["b"]=[1,2,2,3,4]
print(df["a"].unique())
will give the following result:
[1 3 4]
So u can store it as a list to a variable if you like, with:
l_of_unique_vals=df["a"].unique()
and find its length or do anything as you like
df = pd.read_excel("nameoffile.xlsx", sheet_name=name_of_sheet_you_are_loading)
#in the line above we are reading the file in a pandas dataframe and giving it a name df
df["column you want to find vals from"].unique()
First you can use Pandas read_exel and then unique such as #Inder suggested.
import pandas as pd
df = pd.read_exel('name_of_your_file.xlsx')
print(df['columns'].unique())
See more here.

How to import all fields from xls as strings into a Pandas dataframe?

I am trying to import a file from xlsx into a Python Pandas dataframe. I would like to prevent fields/columns being interpreted as integers and thus losing leading zeros or other desired heterogenous formatting.
So for an Excel sheet with 100 columns, I would do the following using a dict comprehension with range(99).
import pandas as pd
filename = 'C:\DemoFile.xlsx'
fields = {col: str for col in range(99)}
df = pd.read_excel(filename, sheetname=0, converters=fields)
These import files do have a varying number of columns all the time, and I am looking to handle this differently than changing the range manually all the time.
Does somebody have any further suggestions or alternatives for reading Excel files into a dataframe and treating all fields as strings by default?
Many thanks!
Try this:
xl = pd.ExcelFile(r'C:\DemoFile.xlsx')
ncols = xl.book.sheet_by_index(0).ncols
df = xl.parse(0, converters={i : str for i in range(ncols)})
UPDATE:
In [261]: type(xl)
Out[261]: pandas.io.excel.ExcelFile
In [262]: type(xl.book)
Out[262]: xlrd.book.Book
Use dtype=str when calling .read_excel()
import pandas as pd
filename = 'C:\DemoFile.xlsx'
df = pd.read_excel(filename, dtype=str)
the usual solution is:
read in one row of data just to get the column names and number of columns
create the dictionary automatically where each columns has a string type
re-read the full data using the dictionary created at step 2.

Categories

Resources