Creating dataframe from another dataframe and list - python

I have a list of names like this
names= [Josh,Jon,Adam,Barsa,Fekse,Bravo,Talyo,Zidane]
and i have a dataframe like this
Number Names
0 1 Josh
1 2 Jon
2 3 Adam
3 4 Barsa
4 5 Fekse
5 6 Barsa
6 7 Barsa
7 8 Talyo
8 9 Jon
9 10 Zidane
i want to create a dataframe that will have all the names in names list and the corresponding numbers from this dataframe grouped, for the names that does not have corresponding numbers there should be an asterisk like below
Names Number
Josh 1
Jon 2,9
Adam 3
Barsa 4,6,7
Fekse 5
Bravo *
Talyo 8
Zidane 10
Do we have any built in functions to get this done

You can use GroupBy with str.join, then reindex with your names list:
res = df.groupby('Names')['Number'].apply(lambda x: ','.join(map(str, x))).to_frame()\
.reindex(names).fillna('*').reset_index()
print(res)
Names Number
0 Josh 1
1 Jon 2,9
2 Adam 3
3 Barsa 4,6,7
4 Fekse 5
5 Bravo *
6 Talyo 8
7 Zidane 10

Related

Disproportionate stratified sampling in Pandas

How can I randomly select one row from each group (column Name) in the following dataframe:
Distance Name Time Order
1 16 John 5 0
4 31 John 9 1
0 23 Kate 3 0
3 15 Kate 7 1
2 32 Peter 2 0
5 26 Peter 4 1
Expected result:
Distance Name Time Order
4 31 John 9 1
0 23 Kate 3 0
2 32 Peter 2 0
you can use a groupby on Name col and apply sample
df.groupby('Name',as_index=False).apply(lambda x:x.sample()).reset_index(drop=True)
Distance Name Time Order
0 31 John 9 1
1 15 Kate 7 1
2 32 Peter 2 0
You can shuffle all samples using, for example, the numpy function random.permutation. Then groupby by Name and take N first rows from each group:
df.iloc[np.random.permutation(len(df))].groupby('Name').head(1)
you can achive that using unique
df['Name'].unique()
Shuffle the dataframe:
df.sample(frac=1)
And then drop duplicated rows:
df.drop_duplicates(subset=['Name'])
df.drop_duplicates(subset='Name')
Distance Name Time Order
1 16 John 5 0
0 23 Kate 3 0
2 32 Peter 2 0
This should help, but this not random choice, it keeps the first
How about using random
like this,
Import your provided data,
df=pd.read_csv('random_data.csv', header=0)
which looks like this,
Distance Name Time Order
1 16 John 5 0
4 3 John 9 1
0 23 Kate 3 0
3 15 Kate 7 1
then get a random column name,
colname = df.columns[random.randint(1, 3)]
and below it selected 'Name',
print(df[colname])
1 John
4 John
0 Kate
3 Kate
Name: Name, dtype: object
Of course I could have condensed this to,
print(df[df.columns[random.randint(1, 3)]])

How can we fill the empty values in the column?

I have table A with 3 columns. The column (val) has some of the empty values. The question is: Is there any way to fill the empty values based on the previous value using python. For example, Alex and John take vale 20 and Sam takes value 100.
A = [ id name val
1 jack 10
2 mec 20
3 alex
4 john
5 sam 250
6 tom 100
7 sam
8 hellen 300]
You can try to take in data as a pandas Dataframe and use the built in function fillna() to solve your problem. For an example,
df = # your data
df.fillna(method='pad')
Would return a dataframe like,
id name val
0 1 jack 10
1 2 mec 20
2 3 alex 20
3 4 john 20
4 5 sam 250
5 6 tom 100
6 7 sam 100
7 8 hellen 300
You can refer to this page for more information.

Comparing string entries in two Pandas series

I have two panda series, and would simply like to compare their string values, and returning the strings (and maybe indices too) of the values they have in common e.g. Hannah, Frank and Ernie in the example below::
print(x)
print(y)
0 Anne
1 Beth
2 Caroline
3 David
4 Ernie
5 Frank
6 George
7 Hannah
Name: 0, dtype: object
1 Hannah
2 Frank
3 Ernie
4 NaN
5 NaN
6 NaN
7 NaN
Doing
x == y
throws a
ValueError: Can only compare identically-labeled Series objects
as does
x.sort_index(axis=0) == y.sort_index(axis=0)
and
x.reindex_like(y) > y
does something, but not the right thing!
If need common values only you can use convert first column to set and use intersection:
a = set(x).intersection(y)
print (a)
{'Hannah', 'Frank', 'Ernie'}
And for indices need merge by default inner join with reset_index for convert indices to columns:
df = pd.merge(x.rename('a').reset_index(), y.rename('a').reset_index(), on='a')
print (df)
index_x a index_y
0 4 Ernie 3
1 5 Frank 2
2 7 Hannah 1
Detail:
print (x.rename('a').reset_index())
index a
0 0 Anne
1 1 Beth
2 2 Caroline
3 3 David
4 4 Ernie
5 5 Frank
6 6 George
7 7 Hannah
print (y.rename('a').reset_index())
index a
0 1 Hannah
1 2 Frank
2 3 Ernie
3 4 NaN
4 5 NaN
5 6 NaN
6 7 NaN

Merging pandas columns (many-to-one)

I have two dataframes:
df1
ID
1
2
3
4
5
6
7
8
9
10
df2:
Name Count
raj 2
dinesh 3
sachin 3
glen 2
Now I want to create a third dataframe with parent dataframe as df1 with second column inserted as "Owner" with 2 rows assigned to raj, 3 to dinesh, 3 to sachin and 2 to glen. The third datframe will look like this:
df3:
ID Owner
1 raj
2 raj
3 dinesh
4 dinesh
5 dinesh
6 sachin
7 sachin
8 sachin
9 glen
10 glen
I'll highly appreciate all your help.
It seems you need numpy.repeat but is necessary sum of all Count values is same as length of df1:
df1['Owner'] = np.repeat(df2['Name'].values, df2['Count'].values)
print (df1)
ID Owner
0 1 raj
1 2 raj
2 3 dinesh
3 4 dinesh
4 5 dinesh
5 6 sachin
6 7 sachin
7 8 sachin
8 9 glen
9 10 glen

How to correctly sort a multi-indexed pandas DataFrame

I have a multi-indexed pandas dataframe that looks like this:
Antibody Time Repeats
Akt 0 1 1.988053
2 1.855905
3 1.416557
5 1 1.143599
2 1.151358
3 1.272172
10 1 1.765615
2 1.779330
3 1.752246
20 1 1.685807
2 1.688354
3 1.614013
..... ....
0 4 2.111466
5 1.933589
6 1.336527
5 4 2.006936
5 2.040884
6 1.430818
10 4 1.398334
5 1.594028
6 1.684037
20 4 1.529750
5 1.721385
6 1.608393
(Note that I've only posted one antibody, there are many analogous entries under the antibody index) but they all have the same format. Despite missing out the entries in the middle for the sake of space you can see that I have 6 experimental repeats but they are not organized properly. My question is: how would I get the DataFrame to aggregate all the repeats. So the output would look something like this:
Antibody Time Repeats
Akt 0 1 1.988053
2 1.855905
3 1.416557
4 2.111466
5 1.933589
6 1.336527
5 1 1.143599
2 1.151358
3 1.272172
4 2.006936
5 2.040884
6 1.430818
10 1 1.765615
2 1.779330
3 1.752246
4 1.398334
5 1.594028
6 1.684037
20 1 1.685807
2 1.688354
3 1.614013
4 1.529750
5 1.721385
6 1.60839
..... ....
Thanks in advance
I think you need sort_index:
df = df.sort_index(level=[0,1,2])
print (df)
Antibody Time Repeats
Akt 0 1 1.988053
2 1.855905
3 1.416557
4 2.111466
5 1.933589
6 1.336527
5 1 1.143599
2 1.151358
3 1.272172
4 2.006936
5 2.040884
6 1.430818
10 1 1.765615
2 1.779330
3 1.752246
4 1.398334
5 1.594028
6 1.684037
20 1 1.685807
2 1.688354
3 1.614013
4 1.529750
5 1.721385
6 1.608393
Name: col, dtype: float64
Or you can omit parameter levels:
df = df.sort_index()
print (df)
Antibody Time Repeats
Akt 0 1 1.988053
2 1.855905
3 1.416557
4 2.111466
5 1.933589
6 1.336527
5 1 1.143599
2 1.151358
3 1.272172
4 2.006936
5 2.040884
6 1.430818
10 1 1.765615
2 1.779330
3 1.752246
4 1.398334
5 1.594028
6 1.684037
20 1 1.685807
2 1.688354
3 1.614013
4 1.529750
5 1.721385
6 1.608393
Name: col, dtype: float64

Categories

Resources