try loc a column with list pandas - index not found - python

I have a fixed array e.g. sort_by = [a,b,c,d,e,f]. My dataframe looks like this, I have made Column1 my index:
Column1 | Column2 | ...
d 1
d 2
b 3
a 4
a 5
b 6
c 7
I want to loc from the sort_by list to sort them, however, sometimes not all values of sort_by are in Column which results in index not found. How do I get it to "try" to the best of its ability?
s.set_index('mitre_attack_tactic', inplace=True)
print(s.loc[sort_by]) --> doesn't work
print(s.loc[[a,b,c,d]) --> does work however Column1 could have e,f,g

Let us try pd.Categorical
out = df.iloc[pd.Categorical(df.Column1,['a','b','c','d']).argsort()]
Out[48]:
Column1 Column2
3 a 4
4 a 5
2 b 3
5 b 6
6 c 7
0 d 1
1 d 2

You can use the key of df.sort_values. Idea is to create a value index dictionary from sort_by list then map the dictionary to column and sort by the resulted index.
key = {v:k for k, v in enumerate(sort_by)}
df = df.sort_values('Column1', key=lambda col: col.map(key))
print(df)
Column1 Column2
3 a 4
4 a 5
2 b 3
5 b 6
6 c 7
0 d 1
1 d 2

This page helps:
If you create your sort_by as a categorical:
sort_by = pd.api.types.CategoricalDtype(["a","b","c","d","e","f"], ordered=True)
Then change your column to a categorical:
s['Column1'] = s['Column1'].astype(sort_by)
You can then sort it:
s.sort_values('Column1')

index.intersection
df.loc[pd.Index(sort_by).intersection(df.index)]
Column2
a 4
a 5
b 3
b 6
c 7
d 1
d 2

Related

Merging Pandas Dataframes averiging values where both have values

I have two Dataframes.
print(df1)
key value
0 A 2
1 B 3
2 C 2
3 D 3
print(df2)
key value
0 B 3
1 D 1
2 E 1
3 F 3
What I want is for it to do a outer merge on key and pick whichever value is not NaN.
Which one it choses if both are int (or float) is not that important. The mean would be a nice touch though.
print(df3)
key value
0 A 2
1 B 3
3 C 2
4 D 2
5 E 1
6 F 3
I tried:
df3 = df1.merge(df2, on='key', how='outer')
but it generates 2 new columns. I could just do my calculations after, but am sure there is an easier solution, that I just could not find.
Thanks for your help.
This works for me, the duplicates are dropped in order of the dataframe entry, so the dupes from df1 are dropped and df2 are kept, if any keys don't match the duplicate key or both happen to be na we can drop them .dropna()
dfs = pd.concat([df1,df2]).drop_duplicates(subset=['key'],keep='last').dropna(how='any')
key value
0 A 2
2 C 2
3 D 3
0 B 3
1 D 1
2 E 1
3 F 3

How to select the actual row and the above based on specific string in pandas?

this is an example of a bigger dataframe:
column1
0 a
1 b
2 x
3 c
4 b
5 x
6 d
7 x
8 e
9 e
In this dataframe, I would like to select every row that has 'x' on it and also the exaclty rows above each of these ones. And then I want to create another dataframe with these new rows.
The final dataframe should be like this:
column1
1 b
2 x
4 b
5 x
6 d
7 x
Anyone could help me?
Thanks
You can use shift:
print (df.loc[df["column1"].eq("x")|df["column1"].eq("x").shift(-1)])
column1
1 b
2 x
4 b
5 x
6 d
7 x
use shift()
df = pd.DataFrame({'column1':['a','b','x','c','b','x','d','x','e','e']})
df[(df['column1'] == 'x') | (df['column1'].shift(-1) == 'x')]
produces
column1
1 b
2 x
4 b
5 x
6 d
7 x

How do I merge the contents of columns in a dataframe in Python?

I'm new to python and dataframes so I am wondering if someone knows how I could accomplish the following. I have a dataframe with many columns, some which share a beginning and have an underscore followed by a number (bird_1, bird_2, bird_3). I want to essentially merge all of the columns that share a beginning into singular columns with all the values that were contained in the constituent columns. Then I'd like to run df[columns].value_counts for each.
Initial dataframe
Final dataframe
For df[bird].value_counts(), I would get a count of 1 for A-L
For df[cat].value_counts(), I would get a count of 3 for A, 4 for B, 1 for C
The ultimate goal is to get a count of unique values for each column type (bird, cat, dog, etc.)
You can do:
df.columns=[col.split("_")[0] for col in df.columns]
df=df.unstack().reset_index(1, drop=True).reset_index()
df["id"]=df.groupby("index").cumcount()
df=df.pivot(index="id", values=0, columns="index")
Outputs:
index bird cat
id
0 A A
1 B A
2 C A
3 D B
4 E B
5 F B
6 G B
7 H C
8 I NaN
9 J NaN
10 K NaN
11 L NaN
From there to get counts of all possible values:
df.T.stack().reset_index(1, drop=True).reset_index().groupby(["index", 0]).size()
Outputs:
index 0
bird A 1
B 1
C 1
D 1
E 1
F 1
G 1
H 1
I 1
J 1
K 1
L 1
cat A 3
B 4
C 1
dtype: int64

How to convert dictionaries of dataframes(stored as values) to one CSV?

Hi I need some advice on how to convert a dictionary of dataframes to one CSV?
below is my structure
dic_dataframe { "Key value 1":DF1,"Key value 2":DF2}
in above DF1 is like
index A B C
0 a b c
1 x y z
2 1 2 3
and DF2 has same number of columns and columns names are same as well just different rows
index A B C
0 w x y
1 3 4 5
i am expecting a csv file which looks like as follows
index Key_Values A B C
0 Key value 1 a b c
1 Key value 1 x y z
2 Key value 1 1 2 3
3 Key Value 2 w x y
4 Key Value 2 3 4 5
Any help will be greatly appreciated as i have tried many thing but cant get this to work
Use concat with dictionary of, remove second level of MultiIndex by first DataFrame.reset_index, then for new column name add DataFrame.rename_axis and last use reset_index for convert index to column:
dic_dataframe = { "Key value 1":DF1,"Key value 2":DF2}
df = (pd.concat(dic_dataframe)
.reset_index(level=1, drop=True)
.rename_axis('Key_Values')
.reset_index())
print (df)
Key_Values A B C
0 Key value 1 a b c
1 Key value 1 x y z
2 Key value 1 1 2 3
3 Key value 2 w x y
4 Key value 2 3 4 5
Last write to csv by DataFrame.to_csv:
df.to_csv(file, index=False)

Duplicate row of low occurrence in pandas dataframe

In the following dataset what's the best way to duplicate row with groupby(['Type']) count < 3 to 3. df is the input, and df1 is my desired outcome. You see row 3 from df was duplicated by 2 times at the end. This is only an example deck. the real data has approximately 20mil lines and 400K unique Types, thus a method that does this efficiently is desired.
>>> df
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Thought about using something like the following but do not know the best way to write the func.
df.groupby('Type').apply(func)
Thank you in advance.
Use value_counts with map and repeat:
counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)[['Type','Val']]
print(df)
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.
EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:
df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)

Categories

Resources