I have a specific case with my data which I’m unable to find an answer to in any documentation or on stack.
What I’m trying to do is merge duplicates based on the ‘MPN’ Column (and not the Vehicle column).
There are going to be duplicates of MPNs in lots of rows as shown in the first image.
I obviously want to remove duplicate rows which have the same MPN, but MERGE the Category values from the three rows as shown in Image 1 in to one cell separated by colons as shown in Image Two, which would be my desired result after coded.
What I’m asking for: To be able to Merge and Remove duplicates based on rows that contain a duplicate MPN, and merge them in to ONE while retaining the categories separated by a colon.
Look at my before and after images to understand more clearly.
I’m also using Python 3.7 to code this from a csv file, separated by commas.
Before:
After duplicates have merged:
How do I solve the problem?
Assuming df holds you csv data.
First group by based on common column(Vehicle and MNP) and make and update a common separated string on category column.
df['x'] = df.groupby(['foo','bar'])['x'].transform(lambda x: ':'.join(x))
Second remove duplicates
df.drop_duplicates()
Related
Below is the code where 5 dataframes are being generated and I want to combine all the dataframes into one, but since they have different headers of the columns, i think appending it to the list are not retaining the header names instead it is providing numbers.
Is there any other solution to combine the dataframes keeping the header names as it is?
Thanks in advance!!
list=[]
i=0
while i<5:
df = pytrend.interest_over_time()
list.append(df)
i=i+1
df_concat=pd.concat(list,axis=1)
Do you have a common column in the dataframes that you can merge on? In that case - use the data frame merge function.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
I've had to do this recently with two dataframes I had, and I merged on the date column.
Are you trying to add additional columns, or append each dataframe on top of each other?
https://www.datacamp.com/community/tutorials/joining-dataframes-pandas
This link will give you an overview of the different functions you might need to use.
You can also rename the columns, if they do contain the same sort of data. Without an example of the dataframe it's tricky to know.
I imported a .csv file with a single column of data into a dataframe that I am trying to clean up by splitting the column based on various string occurrences within the cells. I've tried numerous means to split the column, but can't seem to get it to work. My latest attempt was using the following:
df.loc[:,'DataCol'] = df.DataCol.str.split(pat=':\n',expand=True)
df
The result is a dataframe that is still one column and completely unchanged. What am I doing wrong? This is my first time doing anything like this so please forgive the simple question.
Df.loc creates a copy of the column you've selected - try replacing the code below with df['DataCol'], which references the actual column in the original dataframe.
df.loc[:,'DataCol']
So, I have two files one with 6 million entries and the other with around 5 million entries. I want to compare a particular column values in both the dataframes. This is the code that I have used:
print(df1['Col1'].isin(df2['col3']).value_counts())
This is essential for me as I want to see the number of True(same) and False(different). I am getting most of the entries around 95% as true however some 5% data is coming as false. I extracted this data by using to_csv and compared the columns using vimdiff and they are all identical, then why is the code labelling them as false(different)? Is there a better and more fullproof method?
Note: I have checked for whitespace in the columns as well. There is no whitespace.
PS. The Pandas.isin documentation states that both index and value has to match. Since I have more entries in 1 file, so the index is not matching for these entries, how to remove that constraint?
First, convert the column you use as parameter inside your isin() method as a list.
Then parse it as a copy of your df1 dataframe because you need to get the value counts at the same column you filtered.
From your example:
print(df1[df1['Col1'].isin(df2['col3'].values.tolist())]['Col1'].value_counts())
Try running that again.
I have been using pandas but am open to all suggestions, I'm not an expert at scripting but am a complete loss. My goal is the following:
Merge multiple CSV files. Was able to do this in Pandas and have a dataframe with the merged dataset.
Screenshot of how merged dataset looks like
Delete the duplicated "GEO" columns after the first set. This last part doesn't let me usedf = df.loc[:,~df.columns.duplicated()] because they are not technically duplicated.The repeated column names end with a .1,.2,etc. as I am guessing the concate adds this. Other problem is that some columns have a duplicated column name but are different datasets. I have been using the first row as the index since it's always the same coded values but this row is unnecessary and will be deleted afterwards in the script. This is my biggest problem right now.
Delete certain columns such as the ones with the "Margins". I use ~df2.columns.str.startswith for this and have no trouble with this.
Replace spaces, ":" and ";" with underscores in the first row. I have no clue how to do this.
Insert a new column, write '=TEXT(B1,0)' formula, do this for the whole column (formula would change to B2,B3, etc.), copy the column and paste as values. I was able to do this in openpyxl although was having trouble and was not able to try the final output thanks to excel trouble.
source = excel.Workbooks.Open(filename)
excel.Range("C1:C1337").Select()
excel.Selection.Copy()
excel.Selection.PasteSpecial(Paste=constants.xlPasteValues)
Not sure if it works and was wondering if it was possible in pandas, win32com or I should stay with openpyxl. Thanks all!
I am currently using Python and I got a dataframe including a column with PartNumbers.
These part numbers have various patterns: e.g. 500-1222-33, 48L48 etc.
However, I want to remove rows having the following format: e.g. 06/06/3582.
Is there a way to remove the rows with these value-patterns from the dataframe?
Thanks in advance.