How to split this data in Python pandas dataframe? - python

The is my pandas data frame, In the index column i want to keep only the values after double underscore(__) and remove the rest.

Use str.split with parameter n=1 for split by first splitter (if possible multiple __) and select second lists:
df['index'].str.split('__', n=1).str[1]
Or use list comprehension if no missing values and performance is important:
df['last'] = [x.split('__', 1)[1] for x in df['index']]

df['index'].apply(lambda x: x.split('__')[-1]) will do the trick

Related

Splitting column into multiple columns every other delimiter in python

I have a column that I am trying to split into multiple columns in python. The data in the column looks like this below,
1;899.618000;2;0.551582;7;93.643914;8;12.00000
I need to split this column by every other delimiter (;) into separate columns, so I need it to look like the below.
Col1
1;899.618000
Assuming the data is consistently float-like, you can use a regex that checks if you have a non float representation after the separator:
s = '1;899.618000;2;0.551582;7;93.643914;8;12.00000'
import re
re.split(';(?=\d+;)', s)
output:
['1;899.618000', '2;0.551582', '7;93.643914', '8;12.00000']
this should do the trick
s = "1;899.618000;2;0.551582;7;93.643914;8;12.00000"
l1 = s.split(";")
l2 = [l1[i]+l1[i+1] for i in range (0,len(l1),2)]
the value of l2 will be
['1899.618000', '20.551582', '793.643914', '812.00000']

How to Removing Duplicate Values within String Which is inside Python Data-frame Cell

I have the following Dataframe
If you look at the first row, the string consists of duplicate values; Ex GM0001, GMM003 and so on.
Is it possible to remove those duplicates within each Cell in SITE_ID column ??
You can turn the tuples into sets:
df['SITE_ID_UNIQUE'] = df.SITE_ID.apply(set)
Vinura Perera answer works just fine... provided you are okay with the brackets instead of tuples. It also adds another column to your dataframe. If you need tuples and don't want to create another column try this:
df['SITE'] = [str(set(i)).replace('{', '(').replace('}', ')') for i in df['SITE']]

Performing a JSON operation on dataframe column

I have a data frame, where one of the columns is a column of strings, which can be converted by separately with json.loads(string) to a dictionary.
I'd like to perform json.loads() on the entire column at once, turning the column of strings, to a column of dictionaries.
Is this possible?
You can use apply or list comprehension:
df['col'] = df['col'].apply(pd.io.json.loads)
df['col'] = [pd.io.json.loads(x) for x in df['col']]
Another more general solution:
import ast
df['col'] = df['col'].apply(ast.literal_eval)

Pandas .isin() for list of values in each row of a column

I have a small problem: I have a column in my DataFrame, which has multiple rows, and in each row it holds either 1 or more values starting with 'M' letter followed by 3 digits. If there is more than 1 value, they are separated by a comma.
I would like to print out a view of the DataFrame, only featuring rows where that 1 column holds values I specify (e.g. I want them to hold any item from list ['M111', 'M222'].
I have started to build my boolean mask in the following way:
df[df['Column'].apply(lambda x: x.split(', ').isin(['M111', 'M222']))]
In my mind, .apply() with .split() methods in there first convert 'Column' values to lists in each row with 1 or more values in it, and then .isin() method confirms whether or not any of items in list of items in each row are in the list of specified values ['M111', 'M222'].
In practice however, instead of getting a desired view of DataFrame, I get error
'TypeError: unhashable type: 'list'
What am I doing wrong?
Kind regards,
Greem
I think you need:
df2 = df[df['Column'].str.contains('|'.join(['M111', 'M222']))]
You can only access the isin() method with a Pandas object. But split() returns a list. Wrapping split() in a Series will work:
# sample data
data = {'Column':['M111, M000','M333, M444']}
df = pd.DataFrame(data)
print(df)
Column
0 M111, M000
1 M333, M444
Now wrap split() in a Series.
Note that isin() will return a list of boolean values, one for each element coming out of split(). You want to know "whether or not any of item in list...are in the list of specified values", so add any() to your apply function.
df[df['Column'].apply(lambda x: pd.Series(x.split(', ')).isin(['M111', 'M222']).any())]
Output:
Column
0 M111, M000
As others have pointed out, there are simpler ways to go about achieving your end goal. But this is how to resolve the specific issue you're encountering with isin().

Keep rows from a dataframe whose index name is NOT in a given list

So, I have a list with tuples, and a multi-index dataframe. I want to find the rows of the dataframe whose indices are NOT included in the list of tuples, and create a new dataframe with these elements. Any help? Thanx!
You can use isin with a negation to explicitly filter your DataFrame:
new_df = df[~df.index.isin(list_of_tuples)]
Alternatively, use drop to remove the tuples you don't want to be included in the new DataFrame.
new_df = df.drop(list_of_tuples)
From a couple simple tests, using isin appears to be faster, although drop is a bit more readable.

Categories

Resources