I have a data frame like this:
a1xbxc,a2xbxc
show 1 2
where a,b,c are attributes that can have different values e.g. a1,a2 for a and so on. Now the way I have this df is not good for plotting barcharts. I want to have it in a normal way like this:
show factor a | factor b | value
a1 | b | 1
a2 | b | 2
How would I go to achieve this? I know I should somehow split each header by ("x") and then find out to which factor it belongs and then write it into a new row, but it seems somewhat that there must me some easy way to do this in pandas.
Any ideas?
Try this:
df.columns = pd.MultiIndex.from_tuples(df.columns.str.split('x').to_series().apply(tuple))
df.stack([0, 1])
c
show a1 b 1
a2 b 2
Related
I am working with a large dataset with a column for reviews which is comprised of a series of strings for example: "A,B,C" , "A,B*,B" etc..
for example,
import pandas as pd
df=pd.DataFrame({'cat1':[1,2,3,4,5],
'review':['A,B,C', 'A,B*,B,C', 'A,C', 'A,B,C,D', 'A,B,C,A,B']})
df2 = df["review"].str.split(",",expand = True)
df.join(df2)
I want to split that column up into separate columns for each letter, then add those columns into the original data frame. I used df2 = df["review"].str.split(",",expand = True) and df.join(df2) to do that.
However, when i use df["A"].unique() there are entries that should not be in the column. I only want 'A' to appear there, but there is also B and C. Also, B and B* are not splitting into two columns.
My dataset is quite large so I don't know how to properly illustrate this problem, I have tried to provide a small scale example, however, everything seems to be working correctly in this example;
I have tried to look through the original column with df['review'].unique() and all entries were entered correctly (no missing commas or anything like that), so I was wondering if there is something wrong with my approach that would influence it to not work correctly across all datasets. Or is there something wrong with my dataset.
Does anyone have any suggestions as to how I should troubleshoot?
when i use df["A"].unique() there are entries that should not be in the column. I only want 'A' to appear there
IIUC, you wanted to create dummy variables instead?
df2 = df.join(df['review'].str.get_dummies(sep=',').pipe(lambda x: x*[*x]).replace('',float('nan')))
Output:
cat1 review A B B* C D
0 1 A,B,C A B NaN C NaN
1 2 A,B*,B,C A B B* C NaN
2 3 A,C A NaN NaN C NaN
3 4 A,B,C,D A B NaN C D
4 5 A,B,C,A,B A B NaN C NaN
Ok I edited this to (hopefully) simplify the problem. I have 2 variables and they each have the following output:
var1: Index(['B4_1', 'B4_2','B4_3', 'B4_4'],
dtype='object', length=4)
var2: Index(['B1_1', 'B1_2','B1_3', 'B1_4'],
dtype='object', length=4)
I am trying to combine them into one variable var that looks like this (order does not matter):
Index(['B4_1', 'B4_2','B4_3', 'B4_4','B1_1', 'B1_2','B1_3', 'B1_4'],
dtype='object', length=8)
Does anyone know how to do this?
For more context to the problem:
Each of these strings (i.e. 'B1_1') is the row index of a dataframe and each corresponds to an x,y coordinate in that dataframe. I am trying to plot all of the coordinates that correspond to these strings in one scatter plot. The dataframe looks like this:
'x' 'y'
B1_1 0 | 1
B2_1 1 | 5
B3_1 8 | -2
B4_1 0 | 0
... ... | ...
B1_4 16 | 0
B2_4 10 | -5
B3_4 0 | 9
B4_4 8 | -2
I am trying to plot all of the points that correspond to the B1 and B4 samples, so I do not want to plot every coordinate pair in the dataframe.
For a single sample I can plot as such:
var1 = df.index[df.index.str.contains('B4')]
ax.scatter(df.loc[var1, 'x'], df.loc[var1, 'y'])
How do I do this for both B1 and B4 together?
I figured it out. You can use Index.union to combine 2 index objects.
so in this case:
if var and var2 are both Index objects:
var = var.union(var2)
will combine into one index object by taking the union.
I have the dataframe:
c1 | c2 | c3 | c4
5 | 4 | 9 | 3
How could I perform element wise division (or some other operation) between c1/c2 and c3/c4
So that the outcome is:
.5555 | 1.33333
I've tried:
df[['c1', 'c2']].div(df[['c3', 'c4']], axis='index'))
But that just resulted in NaNs.
Pretty straightforward, just divide by the values
df[['c1', 'c2']]/df[['c3','c4']].values
Orders matter, so make sure to use correct ordering in the denominator. No need to recreate the DataFrame
One solution is to drop down to NumPy and create a new dataframe:
res = pd.DataFrame(df[['c1', 'c2']].values / df[['c3', 'c4']].values)
print(res)
0 1
0 0.555556 1.333333
I'm not positive I'm understanding your question correctly , but you can literally just divide the series.
df['c1/c2'] = df['c1'] / df['c2']
See this answer: How to divide two column in a dataframe
EDIT: Okay, I understand what OPs asking now.. Please see other answer.
I am iterating with a for loop over a table with a html file and I have the following values in variables name, gene_name_1, value1, gene_name_2, value2 in the first iteration.
keyX and valueX are part of a dictionary but I don't know how many keys and values are present for each iteration.
My idea was to use a dictionary which looks more or less like this:
d = {'gene_name_1': 2, 'gene_name_2': 5}
But now I realize that the values of the dictionary would change in every loop iteration, so it could look like this in the next loop:
d = {'gene_name_1': 3, 'gene_name_2': 0, 'gene_name_3': 9}
So I am not quite sure if a dictionary is the best data structure here:
What I would like to obtain is a pandas data frame which looks more or less like this.
| gene_name_1 | gene_name_2 | gene_name_3 | ...
organism1 | 2 | 5 | 0 | ...
organism2 | 3 | 0 | 9 | ...
...
Just to clarify: 0 is for those names where the key does not appear.
My problem is, I don't know the column names or the amount of columns. I wanted to start with an empty data frame, but I am not sure if this is the best way to do it.
How can I start on a data frame where I don't know the names or the amount of columns?
I hope this was understandable, if I should clarify somehow, please let me know.
I think you need create list of dicts and pass it to DataFrame constructor, last replace NaN to 0 by fillna:
d = {'gene_name_1': 2, 'gene_name_2': 5}
d1 = {'gene_name_1': 3, 'gene_name_2': 0, 'gene_name_3': 9}
#use loop
L = [d, d1]
df = pd.DataFrame(L).fillna(0)
print (df)
gene_name_1 gene_name_2 gene_name_3
0 2 5 0.0
1 3 0 9.0
I'd like to take a dataframe and visualize how useful each column is in a k-neighbors analysis so I was wondering if there was a way to loop through dropping columns and analyzing the dataframe in order to produce an accuracy for every single combination of columns. I'm really not sure if there are some functions in pandas that I'm not aware of that could make this easier or how to loop through the dataframe to produce every combination of the original dataframe. If I have not explained it well I will try and create a diagram.
a | b | c | | labels |
1 | 2 | 3 | | 0 |
5 | 6 | 7 | | 1 |
The dataframe above would produce something like this after being run through the splitting and k-neighbors function:
a & b = 43%
a & c = 56%
b & c = 78%
a & b & c = 95%
import itertools
min_size = 2
max_size = df.shape[1]
column_subsets = itertools.chain(*map(lambda x: itertools.combinations(df.columns, x), range(min_size,max_size+1)))
for column_subset in column_subsets:
foo(df[list(column_subset)])
where df is your dataframe and foo is whatever kNA you're doing. Although you said "all combinations", I put min_size at 2 since your example has only size >= 2. And these are more precisely referred to as "subsets" rather than "combinations".