Why after using the split command it deletes all data from dataframe

Why after using the split command it deletes all data from dataframe - python

From an itertuple loop, I read in some row values and then converted the values read into Series data, then changed the astype to a string and used concat to add it to dff and as shown below.
In [24]: dff
Out[24]:
SRD Aspectno
0 9450 [9450.01, 9450.02]
1 9880 [9880.01, 9880.02, 9880.03]
When I apply the following command line, it strips out all the data. I have used the split command before, It may have something to do with the square brackets, but using str.strip or str(0), also removes all the data.
In [25]: splitdff = dff['Aspectno'].str.split(',', expand = True)
In [26]: splitdff
Out[26]:
0
0 NaN
1 NaN
What am I doing wrong?
Also, when converting the data read after reading the rows, how do I get data in row 0 to be shifted to the left, i.e, [9450.01, 9450.02] shift over to the left by one column?

It looks like you're trying to split a list on a comma, that's a method intended for strings. Try this to break the values out into their own columns:
import pandas as pd
...
dff['Aspectno'].apply(pd.Series)
It will give you a DataFrame with the entries in columns. The lists are different lengths, so there will be a number of columns equal to the length of the longest list. If you know that length you could do this:
dff[['col1','col2','col3']] = dff['Aspectno'].apply(pd.Series)

The code dff['Aspectno'] select the series Aspectno, so [9450.02, 9880.03] and the split on character , doesn't do anything, as there are no commas in the series values.

Related

How to query/filter cells against single values when cells have multiple values?

I have a csv file that follows the following format
Columns one
Column two
Key1
Value1,Value2,value3
Key2
value5
I can easily use a list and .isin to filter the data-frame as follows:
list_keep = ['Value5']
dataframe[dataframe.isin(list_keep).any(axis=1)]
Which gives me the second row, but if there are cells with multiple values (like in the first row in the example table above with the Value1,Value2,value3) then the isin filters no longer works for single values like just value1. This makes sense since the "" is turning them into a single string which I missed because spreadsheets remove the "".
For example,When I do this
list_keep = ['Value1']
dataframe[dataframe.isin(list_keep).any(axis=1)]
Then the nothing is returned because the first row has Value1,Value2,value3 as one single string. (or the first row is not produced as output as the desired outcome).
IMPORTANT NOTE: I want to query all columns not just one.
So, how can I set this code up such I can query multiple elements with cells?
Is there a way to do this in pandas?

You can Stack the dataframe to reshape, then split and explode the strings and use isin to test for occurrence of strings in list_keep, then groupby on level=0 and reduce with any to create a boolean mask:
mask = df.stack().str.split(',').explode().isin(list_keep).groupby(level=0).any()
Alternative approach with applymap and set operations:
mask = df.applymap(lambda s: not set(s.split(',')).isdisjoint(list_keep)).any(1)
>>> df[mask]
Columns one Column two
0 Key1 Value1,Value2,value3

Pandas Column Split (Array)

Below is my data in SQL Server
After reading data in python it become like this
I use the below code to split the value to multiple columns
# 1. To split single array column to multiple column based on '\t'
df[['v1','v2','v3','v4','v5','v6','v7','v8','v9','v10','v11','v12','v13','v14','v15',\
'v16','v17','v18','v19','v20','v21','v22','v23','v24','v25','v26','v27','v28',\
'v29','v30','v31','v32','v33','v34','v35','v36','v37','v38','v39','v40','v41']] = df['_VALUE'].str.split(pat="\t", expand=True)
# 2. To remove the '\r\n' from the last column
df['v41'] = df['v41'].replace(r'\s+|\\n', ' ', regex=True)
But in some data sets the array value is more Eg. 100 columns, the above code is so big. I have to write from V1 to V100. Is there any simple way to do this.

You can replace the hardcoded array from your code with one that generates the array for you using a method like this:
df[[f'v{x}' for x in range(100)]] = df['_VALUE'].str.split(pat="\t", expand=True)

Empty cells on dataframe after use explode()

So I'm new to pandas and this is my first notebook. I needed to join some columns of my dataframe and after that, I wanted to separate the values so it would be better to visualize them.
to join the columns I used df['Q7'] = df[['Q7_Part_1', 'Q7_Part_2', 'Q7_Part_3', 'Q7_Part_4', 'Q7_Part_5','Q7_Part_6','Q7_OTHER']].apply(lambda x : '_'.join(x.dropna().astype(str)), axis=1) and it did well, but i still needed to separate the values and for that i used explode() like: df.Q7 = df.Q7.str.split('_').explode('Q7') and that gave me some empty cells on the dataframe like:
Dataframe
and when i try to visualize the values they just come in empty like:
sum of empty cells
What could I do to not show these empty cells on the viz?
Edit 1: By the way, they not appear as null or NaN cells when I do: df.isnull().sum() or df.isna().sum()

c = ['Q7_Part_1', 'Q7_Part_2', 'Q7_Part_3', 'Q7_Part_4', \
'Q7_Part_5','Q7_Part_6','Q7_OTHER']
df['Q7'] = df[c].apply(lambda x : '_'.join(x.astype(str)), axis=1)
I am not able to replicate your issue but my best guess is if you will do the above the dimension of the list will remain intact and you will get string 'nan' values instead of empty strings.

Remove all rows containing a piece of a string in multiple columns in pandas

I have a very large dataframe with many columns. I want to check all the columns and remove any row containing any instance of the string 'MU', and there are some columns that have 'MU#1' or 'MU#2', and they will sometimes switch places (like 'MU#1 would be in column 1 at index 0 and 'MU#2' will be in column 1 at index 1). Initially, I tried removing them with this but it becomes far too cumbersome if I try to do this for both strings above:
df_slice = df[(df.phase_2 != 'MU#1') & (df.phase_3 != 'MU#1') & (df.phase_1 != 'MU#1') & (df.phase_4 != 'MU#1') ]
This may work, but I have to repeat this slice a few times with other dataframes and I imagine there is a much simpler route. I also have more columns than what is shown above, but that is just a snippet.
Simply put, all columns need to be checked for 'MU' and the rows with 'MU' need to be removed. Thanks!

You could also try .str.contains() and apply to the dataframe. This avoids hardcoding the columns in just in case
df[df.apply(lambda x: (~x.str.contains('MU', case=True, regex=True)))].dropna()
or
df[~df.stack().str.contains('MU').any(level=0)]
How it works
Option 1
when used in df.apply(), x.str.contains, #is a wild card for any column in the datframe that contains
x.str.contains('MU', case=True, regex=True) is a wild card for any column in the datframe that contains 'MU', case sensitive and regular expression implied
~ Reverses, hence you end up with rows that do not have MU
Resulting dataframe returns NaN where the condition is not met. .dropna() hence eliminates the rows with NaN
Option 2
df.stack()# Stacks the dataframe
df.stack().str.contains('MU')#boolean selects rows with the string 'MU'
df.stack().str.contains('MU').any(level=0)# Selects the index
~df.stack().str.contains('MU').any(level=0)# Reverses the selection taking only those without string 'MU'

What we do with all
df = df[df[['phase_1','phase_2','phase_3','phase_4']].ne('MU#1').all(1)]
Update
df = df[(~df[['phase_1','phase_2','phase_3','phase_4']].isin(['MU#1','MU#2'])).all(1)]

This works fine with me.
df[~df.stack().str.contains('Any String').any(level=0)]
Even when searching specific string in the dataframe
df[df.stack().str.contains('Any String').any(level=0)]
Thanks.

How to compare two dataframe columns to see if value in one column is in object of other column

I want to compare two different columns in a dataframe (called station_programming_df). I have one dataframe column that contains integers (called 'facility_id'). I have a second dataframe column that contains a dataframe object (which contains a series of integers)(called 'parsed_final_participant_val') . I want to see if the integer in the first column is in the column with the dataframe object (the second column). If true, I want to return a "1" in a new column (i.e., 'master_color')
I have tried various approaches including using python's "isin" function, which is not returning errors but is also not returning the correct amount. I have also attempted to convert the datatypes as well but with no luck.
station_programming_df['master_color']=np.where(station_programming_df['facility_id'].isin(station_programming_df['final_participants_val']),1,0 )
Here is what the dataframe that I am using looks like:
DATA:
facility_id,final_participants_val,master_color
35862,"62469,33894,33749,34847,21656,35396,4624,69571",0
35396,"62469,33894,33749,34847,21656,35396,4624,69571",0
While no error message is returned, I am not finding any matches. The second row should have returned a "1" in the master_color column.
I am wondering if it has to do with how it is interpreting the series (final_participants_val)
Any help would be really appreciated.

Use DataFrame.apply:
station_programming_df['master_color']=station_programming_df.apply(lambda x: 1 if str(x['facility_id']) in x['final_participants_val'] else 0,axis=1)
print(df)
facility_id final_participants_val master_color
0 35862 62469,33894,33749,34847,21656,35396,4624,69571 0
1 35396 62469,33894,33749,34847,21656,35396,4624,69571 1

You can use df.apply and in.
station_programming_df['master_color'] = station_programming_df.apply(lambda x: str(x.facility_id) in x.final_participants_val, axis=1)
facility_id final_participants_val master_color
0 35862 62469,33894,33749,34847,21656,35396,4624,69571 False
1 35396 62469,33894,33749,34847,21656,35396,4624,69571 True

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why after using the split command it deletes all data from dataframe - python

The code dff['Aspectno'] select the series Aspectno, so [9450.02, 9880.03] and the split on character , doesn't do anything, as there are no commas in the series values.

Related

How to query/filter cells against single values when cells have multiple values?

Pandas Column Split (Array)

Empty cells on dataframe after use explode()

Remove all rows containing a piece of a string in multiple columns in pandas

How to compare two dataframe columns to see if value in one column is in object of other column

Categories

Resources