I have an old excel spreadsheet with a lot of data in a relational database type format, with one main primary key that I need to go through.
I want to compare some rows but there are many entries (thousands of rows, dozens of columns) and Excel doesn't really have built-in features to do this. After looking around I found out the best way to extract the data is using a script with Python, but I have no programming skills in python or any language for the matter. I need to look for duplicates in the key column and then check if there are duplicates rows in that same column and if so merge them in a new row and then a new excel file/sheet separating the merged rows from the non-merged rows.
I don't know if this sounds too complicated or not and I am new here so I did do some research scouring the internet to see if I can find any scripts to do it but no luck really... Here are the closest posts I found that may have something to do with what I want but what I found usually is about people wanting to merge 2 different excel files together:
http://pbpython.com/excel-file-combine.html
Looking to merge two Excel files by ID into one Excel file using Python 2.7
(I have more links but could only post two.)
Basically i'm looking for duplicate rows and want to merge them together into a new file or spreadsheet in excel, separating them from the non dupes and putting it all back together.
Related
I have a folder of 310 .csv files. Here is a sample of what the contents look like
I need to create a program that goes through all the files, lists the file name, then lists the top 4 values from the table and the x-value associated with it. Ideally this would all be saved to a text doc but as long as it prints in a readable format that would be ideal.
So, what is stopping you? Loop through the files, use pandas.read_csv to import each csv file and merge/join them all into one DataFrame. Use slicing to select the 4 top rows, and you can always print/visualize anything directly in a Jupyter Notebook. Exporting can be done using df.to_csv or any other method and if you need a short introduction to pandas, look here.
Keep in mind that it is always a good idea, to include a Minimal, Reproducible Example. Especially for a complicated merge operation between many DataFrames, this could help you a lot. However, there is no way around some research.
I'm new to the community and I only recently started to use Python and more specifically Pandas.
The data set I have I would like the columns to be the date. For each Date I would like to have a customer list that then breaks down to more specific row elements. Everything would be rolled up by an order number, so a distinct count on an order number because sometimes a client purchases more than 1 item. In excel I create a pivot table and process it by distinct order. Then I sort each row element by the distinct count of the order number. I collapse each row down until I just have the client name. If I click to expand the cell then I see each row element.
So my question: If I'm pulling in these huge data sets as a dataframe can I pull in xlsx in as an array? I know it will strip the values, so I would have to set the datetime as a datetime64 element. I've been trying to reshape the array around the date being column, and the rows I want but so far I haven't had luck. I have tried to use pivot_table and groupby with some success but I wasn't able to move the date to the column.
Summary: Overall what I'm looking to know is am I going down the wrong rabbit hole together? I'm looking to basically create a collapsible pivot table with specific color parameters for the table as well so that the current spreadsheet will look identical to the one I'm automating.
I really appreciate any help, as I said I'm brand new to Pandas so direction is key. If I know I'm onto the "best" way of dealing with the export to excel after I've imported and modified the spreadsheet. I get a single sheet of raw data kicked out in .xlsx form. Thanks again!
I have a worksheet with many duplicate rows in which only one column is important that it differs.
Is there a function that will put each of the differing streams into new columns, with the header being the date of the stream?
Essentially, I would like to have each song as a row and the day's streams as a column in that row. Please see the attached image for the end result I would like to achieve.
If this is possible in Python, that would be great as well, as I am pulling the data via a Python script using openpyxl.
Thanks!
I'm pretty new to Pandas and Python, but have solid coding background. I've decided to pick this up because it will help me automate certain financial reports at work..
To give you a basic background of my issue, I'm taking a PDF and using Tabula to reformat it into a CSV file, which is working fine but giving me certain formatting issues. The reports come in about 60 page PDF files, which I am exporting to a CSV and then trying to manipulate the data in Python using Pandas.
The issue: when I reformat the data, I get a CSV file that looks something like this -
The issue here is that certain tables are shifting and I think it is due to the amount of pages and multiple headings within those.
Would it be possible for me to reformat this data using Pandas, and basically create a set of rules for how it gets reformatted?
Basically, I would like to shift the rows that are misplaced back into their respective places based on something like blank spaces.
Is it possible for me to delete rows with certain strings - deleting extra/unnecessary headers.
Can I somehow save the 'Total' data at the bottom by searching for the row with 'Total' and placing it somewhere else?
In essence, is there a way to partition this data by a set of commands (without specifying row numbers - because this changes daily) and then reposition it accordingly so that I can manipulate the data however necessary?
I have several thousand excel documents. All of these documents are 95% the same in terms of column headings. However, since they are not 100% identical, I cannot simply merge them together and upload it into a database without messing up the data.
Would anyone happen to have a library or an example that they've ran into that would help?
If a large proportion of them are similar, and this is a one-off operation it may be worth your while coding the solution for the majority and handling the other documents (or groups of them if they are similar) separately. If using Python to do this you could simply build a dynamic query where the columns that are present in a given excel sheet are built into the INSERT statements. Of course, this assumes that your database table allows for NULLs or that a default value is present on the columns that aren't in a given document.