I have some troubles with my Python work,
my steps are:
1)add the list to ordinary Dataframe
2)delete the columns which is min in the list
my list is called 'each_c' and my ordinary Dataframe is called 'df_col'
I want it to become like this:
hope someone can help me, thanks!
This is clearly described in the documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html
df_col.drop(columns=[3])
Convert each_c to Series, append by DataFrame.append and then get indices by minimal value by Series.idxmin and pass to drop - it remove only first minimal column:
s = pd.Series(each_c)
df = df_col.append(s, ignore_index=True).drop(s.idxmin(), axis=1)
If need remove all columns if multiple minimals:
each_c = [-0.025,0.008,-0.308,-0.308]
s = pd.Series(each_c)
df_col = pd.DataFrame(np.random.random((10,4)))
df = df_col.append(s, ignore_index=True)
df = df.loc[:, s.ne(s.min())]
print (df)
0 1
0 0.602312 0.641220
1 0.586233 0.634599
2 0.294047 0.339367
3 0.246470 0.546825
4 0.093003 0.375238
5 0.765421 0.605539
6 0.962440 0.990816
7 0.810420 0.943681
8 0.307483 0.170656
9 0.851870 0.460508
10 -0.025000 0.008000
EDIT: If solution raise error:
IndexError: Boolean index has wrong length:
it means there is no default columns name by range - 0,1,2,3. Possible solution is set index values in Series by rename:
each_c = [-0.025,0.008,-0.308,-0.308]
df_col = pd.DataFrame(np.random.random((10,4)), columns=list('abcd'))
s = pd.Series(each_c).rename(dict(enumerate(df.columns)))
df = df_col.append(s, ignore_index=True)
df = df.loc[:, s.ne(s.min())]
print (df)
a b
0 0.321498 0.327755
1 0.514713 0.575802
2 0.866681 0.301447
3 0.068989 0.140084
4 0.069780 0.979451
5 0.629282 0.606209
6 0.032888 0.204491
7 0.248555 0.338516
8 0.270608 0.731319
9 0.732802 0.911920
10 -0.025000 0.008000
Related
I have a data-frame with column1 containing string values and column 2 containing lists of sting values.
I want to iterate through column1 and concatenate column1 values with their corresponding row values into a new data-frame.
Say, my input is
`dfd = {'TRAINSET':['101','102','103', '104'], 'unique':[['a1','x1','b2'],['a1','b3','b2'] ,['d3','g5','x2'],['x1','b2','a1']]}`
after the operation my data will look like this
dfd2 = {'TRAINSET':['101a1','101x1','101b2', '102a1','102b3','102b2','103d3', '103g5','103x2','104x1','104b2', '104a1']}
what i tried is:
dg = pd.concat([g['TRAINSET'].map(g['unique']).apply(pd.Series)], axis = 1)
but i get KeyError:'TRAINSET' as this is probably not the proper syntax
.Also, I would like to remove the Nan values in the list
Here is possible use list comprehension with flatten values of lists, join values by + and pass to DataFrame constructor is necessary:
#if necessary
#df = df.reset_index()
#flatten values with filter out missing values
L = [(str(a) + x) for a, b in df[['TRAINSET','unique']].values for x in b if pd.notna(x)]
df1 = pd.DataFrame({'TRAINSET': L})
print (df1)
TRAINSET
0 101a1
1 101x1
2 101b2
3 102a1
4 102b3
5 102b2
6 103d3
7 103g5
8 103x2
9 104x1
10 104b2
11 104a1
Or use DataFrame.explode (pandas 0.25+), crete default index, remove missing values by DataFrame.dropna and join columns to + with Series.to_frame for one column DataFrame :
df = df.explode('unique').dropna(subset=['unique']).reset_index(drop=True)
df1 = (df['TRAINSET'].astype(str) + df['unique']).to_frame('TRAINSET')
print (df1)
TRAINSET
0 101a1
1 101x1
2 101b2
3 102a1
4 102b3
5 102b2
6 103d3
7 103g5
8 103x2
9 104x1
10 104b2
11 104a1
Coming from your original data you can do the below using explode (new in pandas -0.25+) and agg:
Input:
dfd = {'TRAINSET':['101','102','103', '104'],
'unique':[['a1','x1','b2'],['a1','b3','b2'] ,['d3','g5','x2'],['x1','b2','a1']]}
Solution:
df = pd.DataFrame(dfd)
df.explode('unique').astype(str).agg(''.join,1).to_frame('TRAINSET').to_dict('list')
{'TRAINSET': ['101a1',
'101x1',
'101b2',
'102a1',
'102b3',
'102b2',
'103d3',
'103g5',
'103x2',
'104x1',
'104b2',
'104a1']}
Another solution, just to give you some choice...
import pandas as pd
_dfd = {'TRAINSET':['101','102','103', '104'], 'unique':[['a1','x1','b2'],['a1','b3','b2'] ,['d3','g5','x2'],['x1','b2','a1']]}
dfd = pd.DataFrame.from_dict(_dfd)
dfd.set_index("TRAINSET", inplace=True)
print(dfd)
dfd2 = dfd.reset_index()
def refactor(row):
key, l = str(row["TRAINSET"]), str(row["unique"])
res = [key+i for i in l]
return res
dfd2['TRAINSET'] = dfd2.apply(refactor, axis=1)
dfd2.set_index("TRAINSET", inplace=True)
dfd2.drop("unique", inplace=True, axis=1)
print(dfd2)
I have a big dataframe with many duplicates in it. I want to keep the first and last entry of each duplicate but drop every duplicate in between.
I've already tried to get this done by using df.drop_duplicates with the parameters 'first' and 'last' to get two dataframes and then merge them again to one df so I have the first and last entry, but that didn't work.
df_first = df
df_last = df
df_first['Path'].drop_duplicates(keep='first', inplace=True)
df_last['Path'].drop_duplicates(keep='last', inplace=True)
Thanks for your help in advance!
Use GroupBy.nth for avoid duplicates if group with length is 1:
df = pd.DataFrame({
'a':[5,3,6,9,2,4],
'Path':list('aaabbc')
})
print(df)
a Path
0 5 a
1 3 a
2 6 a
3 9 b
4 2 b
5 4 c
df = df.groupby('Path').nth([0, -1])
print (df)
a
Path
a 5
a 6
b 9
b 2
c 4
**Using group by.nth which is an Updated code from previous solution to get nth entry
def keep_second_dup(duplicate):
duplicate[Columnname]=duplicate[Columnname'].value_counts()
second_duplicate=duplicate[duplicate['Count']>=1]
residual=duplicate[duplicate['Count']==1]
sec=second_duplicated.groupby([Columnname]).nth([1]).reset_index()
final_data=pd.concat([sec,residual])
final_data.drop('Count',axis=1,inplace=True)
return final_data
I am working with Python in Bigquery and have a large dataframe df (circa 7m rows). I also have a list lst that holds some dates (say all days in a given month).
I am trying to create an additional column "random_day" in df with a random value from lst in each row.
I tried running a loop and apply function but being quite a large dataset it is proving challenging.
My attempts passed by the loop solution:
df["rand_day"] = ""
for i in a["row_nr"]:
rand_day = sample(day_list,1)[0]
df.loc[i,"rand_day"] = rand_day
And the apply solution, defining first my function and then calling it:
def random_day():
rand_day = sample(day_list,1)[0]
return day
df["rand_day"] = df.apply(lambda row: random_day())
Any tips on this?
Thank you
Use numpy.random.choice and if necessary convert dates by to_datetime:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
})
day_list = pd.to_datetime(['2015-01-02','2016-05-05','2015-08-09'])
#alternative
#day_list = pd.DatetimeIndex(['2015-01-02','2016-05-05','2015-08-09'])
df["rand_day"] = np.random.choice(day_list, size=len(df))
print (df)
A B rand_day
0 a 4 2016-05-05
1 b 5 2016-05-05
2 c 4 2015-08-09
3 d 5 2015-01-02
4 e 5 2015-08-09
5 f 4 2015-08-09
Let's say I have a DataFrame that looks like this:
a b c d e f g
1 2 3 4 5 6 7
4 3 7 1 6 9 4
8 9 0 2 4 2 1
How would I go about deleting every column besides a and b?
This would result in:
a b
1 2
4 3
8 9
I would like a way to delete these using a simple line of code that says, delete all columns besides a and b, because let's say hypothetically I have 1000 columns of data.
Thank you.
In [48]: df.drop(df.columns.difference(['a','b']), 1, inplace=True)
Out[48]:
a b
0 1 2
1 4 3
2 8 9
or:
In [55]: df = df.loc[:, df.columns.intersection(['a','b'])]
In [56]: df
Out[56]:
a b
0 1 2
1 4 3
2 8 9
PS please be aware that the most idiomatic Pandas way to do that was already proposed by #Wen:
df = df[['a','b']]
or
df = df.loc[:, ['a','b']]
Another option to add to the mix. I prefer this approach for readability.
df = df.filter(['a', 'b'])
Where the first positional argument is items=[]
Bonus
You can also use a like argument or regex to filter.
Helpful if you have a set of columns like ['a_1','a_2','b_1','b_2']
You can do
df = df.filter(like='b_')
and end up with ['b_1','b_2']
Pandas documentation for filter.
there are multiple solution .
df = df[['a','b']] #1
df = df[list('ab')] #2
df = df.loc[:,df.columns.isin(['a','b'])] #3
df = pd.DataFrame(data=df.eval('a,b').T,columns=['a','b']) #4 PS:I do not recommend this method , but still a way to achieve this
Hey what you are looking for is:
df = df[["a","b"]]
You will recive a dataframe which only contains the columns a and b
If you only want to keep more columns than you're dropping put a "~" before the .isin statement to select every column except the ones you want:
df = df.loc[:, ~df.columns.isin(['a','b'])]
If you have more than two columns that you want to drop, let's say 20 or 30, you can use lists as well. Make sure that you also specify the axis value.
drop_list = ["a","b"]
df = df.drop(df.columns.difference(drop_list), axis=1)
I attempting to add a Series to an empty DataFrame and can not find an answer
either in the Doc's or other questions. Since you can append two DataFrames by row
or by column it would seem there must be an "axis marker" missing from a Series. Can
anyone explain why this does not work?.
import Pandas as pd
df1 = pd.DataFrame()
s1 = pd.Series(['a',5,6])
df1 = pd.concat([df1,s1],axis = 1)
#go run some process return s2, s3, sn ...
s2 = pd.Series(['b',8,9])
df1 = pd.concat([df1,s2],axis = 1)
s3 = pd.Series(['c',10,11])
df1 = pd.concat([df1,s3],axis = 1)
If my example above is some how misleading perhaps using the example from the docs will help.
Quoting: Appending rows to a DataFrame.
While not especially efficient (since a new object must be created), you can append a
single row to a DataFrame by passing a Series or dict to append, which returns a new DataFrame as above. End Quote.
The example from the docs appends "S", which is a row from a DataFrame, "S1" is a Series
and attempting to append "S1" produces an error. My question is WHY will appending "S1 not work? The assumption behind the question is that a DataFrame must code or contain axes information for two axes, where a Series must contain only information for one axes.
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
s = df.xs(3); #third row of DataFrame
s1 = pd.Series([np.random.randn(4)]); #new Series of equal len
df= df.append(s, ignore_index=True)
Result
0 1
0 a b
1 5 8
2 6 9
Desired
0 1 2
0 a 5 6
1 b 8 9
You were close, just transposed the result from concat
In [14]: s1
Out[14]:
0 a
1 5
2 6
dtype: object
In [15]: s2
Out[15]:
0 b
1 8
2 9
dtype: object
In [16]: pd.concat([s1, s2], axis=1).T
Out[16]:
0 1 2
0 a 5 6
1 b 8 9
[2 rows x 3 columns]
You also don't need to create the empty DataFrame.
The best way is to use DataFrame to construct a DF from a sequence of Series, rather than using concat:
import pandas as pd
s1 = pd.Series(['a',5,6])
s2 = pd.Series(['b',8,9])
pd.DataFrame([s1, s2])
Output:
In [4]: pd.DataFrame([s1, s2])
Out[4]:
0 1 2
0 a 5 6
1 b 8 9
A method of accomplishing the same objective as appending a Series to a DataFrame
is to just convert the data to an array of lists and append the array(s) to the DataFrame.
data as an array of lists
def get_example(idx):
list1 = (idx+1,idx+2 ,chr(idx + 97))
data = [list1]
return(data)
df1 = pd.DataFrame()
for idx in range(4):
data = get_example(idx)
df1= df1.append(data, ignore_index = True)