Removing rows including values with invalid format - python

I am currently using Python and I got a dataframe including a column with PartNumbers.
These part numbers have various patterns: e.g. 500-1222-33, 48L48 etc.
However, I want to remove rows having the following format: e.g. 06/06/3582.
Is there a way to remove the rows with these value-patterns from the dataframe?
Thanks in advance.

Related

Pandas df.loc with regex

I'm working with a data set consisting of several csv files of nearly the same form. Each csv describes a particular date, and labels the data by state/province. However, the format of one of the column headers in the data set was altered from Province/State to Province_State, so that all csv's created before a certain date use the first format and all csv's created after that date use the second format.
I'm trying to sum up all the entries corresponding to a particular state. At present, the code I'm working with is as follows:
daily_data.loc[daily_data[areaLabel] == location].sum()
where daily_data is the dataframe containing the csv data, location is the name of the state I'm looking for, and arealabel is a variable storing either 'Province/State' or 'Province_State' depending on the result of a date check. I would like to eliminate the date check by e.g. conditioning on a regular expression like Province(/|_)State, but I'm having a lot of trouble finding a way to index into a pandas dataframe by regular expression. Is this doable (and in a way that would make the code more elegant rather than less elegant)? If so, I'd appreciate it if someone could point me in the right direction.
Use filter to get the columns that match your regex
>>> df.filter(regex="Province(/|_)State").columns[0]
'Province/State'
Then use this to select only rows that match your location:
df[df[df.filter(regex="Province(/|_)State").columns[0]]==location].sum()
This however assumes that there are no other columns that would match the regex.

Python data frame formatting string

I've been working on filtering dataframe project. I have a table on Excel, and I converted it to UTF-8 CSV file. I made all my columns on Excel as number with after comma-2 digit.
However as you can see in figure some my columns are different. Default should be xx.xx but some columns seen as xx.xxxxx on dataframe. In xx.xx columns I can filter properly, but the other columns are making problems. I tried to filter it like xx.xxxxx but it didn't work again. How can I get rid of this problem?
In wiew data frame tool of pycharm I can format it but this works for only viewing. What should I do about this?
You can use round to choose the given number of digits you want to keep. This way, you can have an equal number of decimals in all columns.

Splitting column of dataframe based on text characters in cells

I imported a .csv file with a single column of data into a dataframe that I am trying to clean up by splitting the column based on various string occurrences within the cells. I've tried numerous means to split the column, but can't seem to get it to work. My latest attempt was using the following:
df.loc[:,'DataCol'] = df.DataCol.str.split(pat=':\n',expand=True)
df
The result is a dataframe that is still one column and completely unchanged. What am I doing wrong? This is my first time doing anything like this so please forgive the simple question.
Df.loc creates a copy of the column you've selected - try replacing the code below with df['DataCol'], which references the actual column in the original dataframe.
df.loc[:,'DataCol']

How to use DataFrame.isin without the constraint of having to match both index and value?

So, I have two files one with 6 million entries and the other with around 5 million entries. I want to compare a particular column values in both the dataframes. This is the code that I have used:
print(df1['Col1'].isin(df2['col3']).value_counts())
This is essential for me as I want to see the number of True(same) and False(different). I am getting most of the entries around 95% as true however some 5% data is coming as false. I extracted this data by using to_csv and compared the columns using vimdiff and they are all identical, then why is the code labelling them as false(different)? Is there a better and more fullproof method?
Note: I have checked for whitespace in the columns as well. There is no whitespace.
PS. The Pandas.isin documentation states that both index and value has to match. Since I have more entries in 1 file, so the index is not matching for these entries, how to remove that constraint?
First, convert the column you use as parameter inside your isin() method as a list.
Then parse it as a copy of your df1 dataframe because you need to get the value counts at the same column you filtered.
From your example:
print(df1[df1['Col1'].isin(df2['col3'].values.tolist())]['Col1'].value_counts())
Try running that again.

How to merge rows using pandas if they have duplicate values?

I have a specific case with my data which I’m unable to find an answer to in any documentation or on stack.
What I’m trying to do is merge duplicates based on the ‘MPN’ Column (and not the Vehicle column).
There are going to be duplicates of MPNs in lots of rows as shown in the first image.
I obviously want to remove duplicate rows which have the same MPN, but MERGE the Category values from the three rows as shown in Image 1 in to one cell separated by colons as shown in Image Two, which would be my desired result after coded.
What I’m asking for: To be able to Merge and Remove duplicates based on rows that contain a duplicate MPN, and merge them in to ONE while retaining the categories separated by a colon.
Look at my before and after images to understand more clearly.
I’m also using Python 3.7 to code this from a csv file, separated by commas.
Before:
After duplicates have merged:
How do I solve the problem?
Assuming df holds you csv data.
First group by based on common column(Vehicle and MNP) and make and update a common separated string on category column.
df['x'] = df.groupby(['foo','bar'])['x'].transform(lambda x: ':'.join(x))
Second remove duplicates
df.drop_duplicates()

Categories

Resources