Split strings into different columns not working correctly - python

I am working with a large dataset with a column for reviews which is comprised of a series of strings for example: "A,B,C" , "A,B*,B" etc..
for example,
import pandas as pd
df=pd.DataFrame({'cat1':[1,2,3,4,5],
'review':['A,B,C', 'A,B*,B,C', 'A,C', 'A,B,C,D', 'A,B,C,A,B']})
df2 = df["review"].str.split(",",expand = True)
df.join(df2)
I want to split that column up into separate columns for each letter, then add those columns into the original data frame. I used df2 = df["review"].str.split(",",expand = True) and df.join(df2) to do that.
However, when i use df["A"].unique() there are entries that should not be in the column. I only want 'A' to appear there, but there is also B and C. Also, B and B* are not splitting into two columns.
My dataset is quite large so I don't know how to properly illustrate this problem, I have tried to provide a small scale example, however, everything seems to be working correctly in this example;
I have tried to look through the original column with df['review'].unique() and all entries were entered correctly (no missing commas or anything like that), so I was wondering if there is something wrong with my approach that would influence it to not work correctly across all datasets. Or is there something wrong with my dataset.
Does anyone have any suggestions as to how I should troubleshoot?

when i use df["A"].unique() there are entries that should not be in the column. I only want 'A' to appear there
IIUC, you wanted to create dummy variables instead?
df2 = df.join(df['review'].str.get_dummies(sep=',').pipe(lambda x: x*[*x]).replace('',float('nan')))
Output:
cat1 review A B B* C D
0 1 A,B,C A B NaN C NaN
1 2 A,B*,B,C A B B* C NaN
2 3 A,C A NaN NaN C NaN
3 4 A,B,C,D A B NaN C D
4 5 A,B,C,A,B A B NaN C NaN

Related

How to read in csv correctly with pandas?

I have a csv-file which looks like this:
A B C
1 2 3 4 5 6 7
8 9 1 2 3 4 5
When I read in this file using this code:
df2 = pd.read_csv(r'path\to\file.csv',delimiter=';')
I get a pandas dataframe which exist of three columns named A, B and C.
The first five rows of my actual csv file are taken as index and the last two rows are stated in column A and B and in C I only get NaN values.
Instead I would like to get a dataframe with columns A, B and C as the first three columns and the rest Unnamed columns. I think it is maybe due to a formatting issue with my csv-file but I do not know hwo to solve this..
Thanks a lot!
Try this:
df2 = pd.read_csv(r'path\to\file.csv',delimiter=' ', names=['A','B','C','D','E','F','G'], skiprows=1,index_col=False)

How Can I drop a column if the last row is nan

I have found examples of how to remove a column based on all or a threshold but I have not been able to find a solution to my particular problem which is dropping the column if the last row is nan. The reason for this is im using time series data in which the collection of data doesnt all start at the same time which is fine but if I used one of the previous solutions it would remove 95% of the dataset. I do however not want data whose most recent column is nan as it means its defunct.
A B C
nan t x
1 2 3
x y z
4 nan 6
Returns
A C
nan x
1 3
x z
4 6
You can also do something like this
df.loc[:, ~df.iloc[-1].isna()]
A C
0 NaN x
1 1 3
2 x z
3 4 6
Try with dropna
df = df.dropna(axis=1, subset=[df.index[-1]], how='any')
Out[8]:
A C
0 NaN x
1 1 3
2 x z
3 4 6
You can use .iloc, .loc and .notna() to sort out your problem.
df = pd.DataFrame({"A":[np.nan, 1,"x",4],
"B":["t",2,"y",np.nan],
"C":["x",3,"z",6]})
df = df.loc[:,df.iloc[-1,:].notna()]
You can use a boolean Series to select the column to drop
df.drop(df.loc[:,df.iloc[-1].isna()], axis=1)
Out:
A C
0 NaN x
1 1 3
2 x z
3 4 6
for i in range(temp_df.shape[1]):
if temp_df.iloc[-1,i] == 'nan':
temp_df = temp_df.drop(i,1)
This will work for you.
Basically what I'm doing here is looping over all columns and checking if last entry is 'nan', then dropping that column.
temp_df.shape[1]
this is the numbers of columns.
pandas.df.drop(i,1)
i represents the column index and 1 represents that you want to drop the column.
EDIT:
I read the other answers on this same post and it seems to me that notna would be best (I would use it), but the advantage of this method is that someone can compare anything they wish to.
Another method I found is isnull() which is a function in the pandas library which will work like this:
for i in range(temp_df.shape[1]):
if temp_df.iloc[-1,i].isnull():
temp_df = temp_df.drop(i,1)

Pandas overwrite values in column selectively based on condition from another column

I have a dataframe in pandas with four columns. The data consists of strings. Sample:
A B C D
0 2 asicdsada v:cVccv u
1 4 ascccaiiidncll v:cVccv:ccvc u
2 9 sca V:c u
3 11 lkss v:cv u
4 13 lcoao v:ccv u
5 14 wuduakkk V:ccvcv: u
I want to replace the string 'u' in Col D with the string 'a' if Col C in that row contains the substring 'V' (case sensitive).
Desired outcome:
A B C D
0 2 asicdsada v:cVccv a
1 4 ascccaiiidncll v:cVccv:ccvc a
2 9 sca V:c a
3 11 lkss v:cv u
4 13 lcoao v:ccv u
5 14 wuduakkk V:ccvcv: a
I prefer to overwrite the value already in Column D, rather than assign two different values, because I'd like to selectively overwrite some of these values again later, under different conditions.
It seems like this should have a simple solution, but I cannot figure it out, and haven't been able to find a fully applicable solution in other answered questions.
df.ix[1]["D"] = "a"
changes an individual value.
df.ix[:]["C"].str.contains("V")
returns a series of booleans, but I am not sure what to do with it. I have tried many many combinations of .loc, apply, contains, re.search, and for loops, and I get either errors or replace every value in column D. I'm a novice with pandas/python so it's hard to know whether my syntax, methods, or conceptualization of what I even need to do are off (probably all of the above).
As you've already tried, use str.contains to get a boolean Series, and then use .loc to say "change these rows and the D column". For example:
In [5]: df.loc[df["C"].str.contains("V"), "D"] = "a"
In [6]: df
Out[6]:
A B C D
0 2 asicdsada v:cVccv a
1 4 ascccaiiidncll v:cVccv:ccvc a
2 9 sca V:c a
3 11 lkss v:cv u
4 13 lcoao v:ccv u
5 14 wuduakkk V:ccvcv: a
(Avoid using .ix -- it's officially deprecated now.)

Can I get a trimmed mean of all columns in a dataframe with nan values?

The problem is that I want to get the trimmed mean of all the columns in a pandas dataframe (i.e. the mean of the values in a given column, excluding the max and the min values). It's likely that some columns will have nan values. Basically, I want to get the exact same functionality as the pandas.DataFrame.mean function, except that it's the trimmed mean.
The obvious solution is to use the scipy tmean function, and iterate over the df columns. So I did:
import scipy as sp
trim_mean = []
for i in data_clean3.columns:
trim_mean.append(sp.tmean(data_clean3[i]))
This worked great, until I encountered nan values, which caused tmean to choke. Worse, when I dropped the nan values in the dataframe, there were some datasets that were wiped out completely as they had an nan value in every column. This means that when I amalgamate all my datasets into a master set, there'll be holes on the master set where the trimmed mean should be.
Does anyone know of a way around this? As in, is there a way to get tmean to behave like the standard scipy stats functions and ignore nan values?
(Note that my code is calculating a big number of descriptive statistics on large datasets with limited hardware; highly involved or inefficient workarounds might not be optimal. Hopefully, though, I'm just missing something simple.)
(EDIT: Someone suggested in a comment (that has since vanished?) that I should used the trim_mean scipy function, which allows you to top and tail a specific proportion of the data. This is just to say that this solution won't work for me, as my datasets are of unequal sizes, so I cannot specify a fixed proportion of data that will be OK to remove in every case; it must always just be the max and the min values.)
consider df
np.random.seed()
data = np.random.choice((0, 25, 35, 100, np.nan),
(1000, 2),
p=(.01, .39, .39, .01, .2))
df = pd.DataFrame(data, columns=list('AB'))
Construct your mean using sums and divide by relevant normalizer.
(df.sum() - df.min() - df.max()) / (df.notnull().sum() - 2)
A 29.707674
B 30.402228
dtype: float64
df.mean()
A 29.756987
B 30.450617
dtype: float64
you colud use df.mean(skipna =True) DataFrame.mean
df1 = pd.DataFrame([[5, 1, 'a'], [6, 2, 'b'],[7, 3, 'd'],[np.nan, 4, 'e'],[9, 5, 'f'],[5, 1, 'g']], columns = ["A", "B", "C"])
print df1
df1 = df1[df1.A != df1.A.max()] # Remove max values
df1 = df1[df1.A != df1.A.min()] # Remove min values
print "\nDatafrmae after removing max and min\n"
print df1
print "\nMean of A\n"
print df1["A"].mean(skipna =True)
output
A B C
0 5.0 1 a
1 6.0 2 b
2 7.0 3 d
3 NaN 4 e
4 9.0 5 f
5 5.0 1 g
Datafrmae after removing max and min
A B C
1 6.0 2 b
2 7.0 3 d
3 NaN 4 e
Mean of A
6.5

Grouping by many columns in Pandas

I basically have a dataset that looks as follows
Col1 Col2 Col3 Count
A B 1 50
A B 1 50
A C 20 1
A D 17 2
A E 5 70
A E 15 20
Suppose it is called data. I basically do data.groupby(by=['Col1', 'Col2', 'Col3'], as_index=False, sort=False).sum(), which should give me this:
Col1 Col2 Col3 Count
A B 1 100
A C 20 1
A D 17 2
A E 5 70
A E 15 20
However, this returns an empty dataset, which does have the columns I want but no rows. The only caveat is that the by parameter is getting calculated dynamically, instead of fixed (thats because the columns might change, although Count will always be there).
Any ideas on why this could be failing, and how to fix it?
EDIT: Further searching revealed that pandas' groupby removes rows that have NULL at any column. This is a problem for me because every single column might be NULL. Hence, the actual question is: any reasonable way to deal with NULLs and still use groupby?
Would love to be corrected here, but I'm not sure if there is a clean way to handle missing data. As you noted, Pandas will just exclude rows from groupby that contain NaN values
You could fill the NaN values with something beyond the range of your data:
data = pd.read_csv("c:/Users/simon/Desktop/data.csv")
data.fillna(-999, inplace=True)
new = data.groupby(by=['Col1', 'Col2', 'Col3'], as_index=False, sort=False).sum()
Which is messy because it wont add those values to the correct group by for the summation. But theres no real way to groupby something thats missing
Another method might be to fill each column separately with some missing value that is appropriate for that variable.

Categories

Resources