Vaex: Is there way to split single column into multiple columns - python

I have been trying to find a way to split a text data(separator is space) in a single column into multiple columns. i can do it by Pandas using the following code, but i would like to do the same with Vaex.
i was looking at the Vaex API document, but i can't see rsplit equivalent method to do so.
https://vaex.readthedocs.io/en/latest/api.html
df_data = df_data.iloc[:,0].apply(lambda x: pd.Series(x.rsplit(" ")))
I have also referred this page who was asking similar question and tried to run the same code. but in my environment i get this Error evaluating: ValueError('No memory tracker found with name default').
vaex extract one column of str.split()
df = pd.DataFrame({'ticker' : ['spx 5/25/2001 p500', 'spx 5/25/2001 p600', 'spx 5/25/2001 p700']})
df_vaex = vaex.from_pandas(df)
df_vaex["ticker"].str.split(" ").apply(lambda x: x[-1])

Are you using the latest version of vaex?
I just tried out your code example and it works fine..

After i restarted my laptop, following split method worked and i got a result as I expected.
df_data = df_data[:,0].str.split(' ').apply(lambda x: x[3])

Related

pandas df.apply() not working with html.unescape()

I'm trying to decode html characters within a pandas dataframe.
I don't know why but my apply function won't work.
# requirements
import html
import pandas as pd
# This code works fine.
df = df.apply(lambda x: x + "TESTSTRING")
print(df) # "TESTSTRING" is appended to all values.
# This code also works fine. html.unescape() is working well.
fn = lambda x: html.unescape(x)
str = "Someting wrong with <b>E&S</b>"
print(fn(str)) # returns "Something wrong with <b>E&S</b>"
# However, the code below doesn't work. The "&" within the values dont' get decoded.
df2 = df.apply(fn)
print(df2) # The html characters aren't decoded!
It's really frustrating that the apply function and html.unescape() is working well separately, but I don't know why they don't work when they are together.
I've also tried axis=1
I'd really appreciate your help. Thanks in advance.
The problem is that html.unexcape() seems unvectorized, i.e. it accepts only one single string.
In case Your df is not really large, using applymap should still be sufficiently fast:
df2 = df.applymap(lambda x: html.unescape(x))
print(df2)

Filtering in pandas: excluding rows that contain part of a string [duplicate]

I've done some searching and can't figure out how to filter a dataframe by
df["col"].str.contains(word)
however I'm wondering if there is a way to do the reverse: filter a dataframe by that set's compliment. eg: to the effect of
!(df["col"].str.contains(word))
Can this be done through a DataFrame method?
You can use the invert (~) operator (which acts like a not for boolean data):
new_df = df[~df["col"].str.contains(word)]
where new_df is the copy returned by RHS.
contains also accepts a regular expression...
If the above throws a ValueError or TypeError, the reason is likely because you have mixed datatypes, so use na=False:
new_df = df[~df["col"].str.contains(word, na=False)]
Or,
new_df = df[df["col"].str.contains(word) == False]
I was having trouble with the not (~) symbol as well, so here's another way from another StackOverflow thread:
df[df["col"].str.contains('this|that')==False]
You can use Apply and Lambda :
df[df["col"].apply(lambda x: word not in x)]
Or if you want to define more complex rule, you can use AND:
df[df["col"].apply(lambda x: word_1 not in x and word_2 not in x)]
I hope the answers are already posted
I am adding the framework to find multiple words and negate those from dataFrame.
Here 'word1','word2','word3','word4' = list of patterns to search
df = DataFrame
column_a = A column name from DataFrame df
values_to_remove = ['word1','word2','word3','word4']
pattern = '|'.join(values_to_remove)
result = df.loc[~df['column_a'].str.contains(pattern, case=False)]
I had to get rid of the NULL values before using the command recommended by Andy above. An example:
df = pd.DataFrame(index = [0, 1, 2], columns=['first', 'second', 'third'])
df.ix[:, 'first'] = 'myword'
df.ix[0, 'second'] = 'myword'
df.ix[2, 'second'] = 'myword'
df.ix[1, 'third'] = 'myword'
df
first second third
0 myword myword NaN
1 myword NaN myword
2 myword myword NaN
Now running the command:
~df["second"].str.contains(word)
I get the following error:
TypeError: bad operand type for unary ~: 'float'
I got rid of the NULL values using dropna() or fillna() first and retried the command with no problem.
To negate your query use ~. Using query has the advantage of returning the valid observations of df directly:
df.query('~col.str.contains("word").values')
Additional to nanselm2's answer, you can use 0 instead of False:
df["col"].str.contains(word)==0
somehow '.contains' didn't work for me but when I tried with '.isin' as mentioned by #kenan in the answer (How to drop rows from pandas data frame that contains a particular string in a particular column?) it works. Adding further, if you want to look at the entire dataframe and remove those rows which has the specific word (or set of words) just use the loop below
for col in df.columns:
df = df[~df[col].isin(['string or string list separeted by comma'])]
just remove ~ to get the dataframe that contains the word
To compliment to the above question, if someone wants to remove all the rows with strings, one could do:
df_new=df[~df['col_name'].apply(lambda x: isinstance(x, str))]

How can I split out this list containing a dictionary into separate columns?

The top table is what I have and the bottom is what I want. I'm doing this in a Pandas dataframe. Any help would be appreciated.
Thanks!
It would have been nice if you provided a code snippet for this since we are unable to easily test your case.
The following lines should do the job:
df['label'] = df['sentiment'].apply(lambda x: x[0]['label'])
df['score'] = df['sentiment'].apply(lambda x: x[0]['score'])

Cleaning data frames with rogue elements using split()

Given the following data in an excel sheet (taken in as a dataframe) :
Name Number Date
AA '9988779911' '01-JAN-18'
'BB' '8779912044' '01-FEB-18'
I have used the following code to clean the dataframe and remove the unnecessary apostrophes;
for name in list(df):
df[name] = df[name].str.split("'").str[1]
And I want the following output :
Name Number Date
AA 9988779911 01-JAN-18
BB 8779912044 01-FEB-18
I am getting the following error :
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
Thanks in advance for your help.:):)
try this,
for name in list(df):
df[name] = df[name].str.replace("\'","")
Replace ' with empty character.
simpler approach
df.applymap(lambda x: x.replace("'",""))
Strip function is probably the shortest way here. The other answers are elegant too.
str.strip("'")
Moshevi has said the same in one of the comments.

Pandas df.apply does not modify DataFrame

I am just starting pandas so please forgive if this is something stupid.
I am trying to apply a function to a column but its not working and i don't see any errors also.
capitalizer = lambda x: x.upper()
for df in pd.read_csv(downloaded_file, chunksize=2, compression='gzip', low_memory=False):
df['level1'].apply(capitalizer)
print df
exit(1)
This print shows the level1 column values same as the original csv not doing upper. Am i missing something here ?
Thanks
apply is not an inplace function - it does not modify values in the original object, so you need to assign it back:
df['level1'] = df['level1'].apply(capitalizer)
Alternatively, you can use str.upper, it should be much faster.
df['level1'] = df['level1'].str.upper()
df['level1'] = map(lambda x: x.upper(), df['level1'])
you can use above code to make your column uppercase

Categories

Resources