Create new fake data with new primary keys from existing dataframe python

Create new fake data with new primary keys from existing dataframe python - python

I have a dataframe as following:
df1 = pd.DataFrame({'id': ['1a', '2b', '3c'], 'name': ['Anna', 'Peter', 'John'], 'year': [1999, 2001, 1993]})
I want to create new data by randomly re-arranging values in each column but for column id I also need to add a random letter at the end of the values, then add the new data to existing df1 as following:
df1 = pd.DataFrame({'id': ['1a', '2b', '3c', '2by', '1ao', '1az', '3cc'], 'name': ['Anna', 'Peter', 'John', 'John', 'Peter', 'Anna', 'Anna'], 'year': [1999, 2001, 1993, 1999, 1999, 2001, 2001]})
Could anyone help me, please? Thank you very much.

Use DataFrame.sample and add random letter by numpy.random.choice:
import string
N = 5
df2 = (df1.sample(n=N, replace=True)
.assign(id =lambda x:x['id']+np.random.choice(list(string.ascii_letters),size=N)))
df1 = df1.append(df2, ignore_index=True)
print (df1)
id name year
0 1a Anna 1999
1 2b Peter 2001
2 3c John 1993
3 1aY Anna 1999
4 3cp John 1993
5 3cE John 1993
6 2bz Peter 2001
7 3cu John 1993

Related

Drop redundant rows for group in pandas

I have the following DataFrame:
import pandas as pd
data = {'id': ['A','A','A','A','A','A',
'A','A','A','A','A','A',
'B','B','B','B','B','B',
'C', 'C', 'C', 'C', 'C', 'C',
'D', 'D', 'D', 'D', 'D', 'D'],
'city':['London', 'London','London', 'London', 'London', 'London',
'New York', 'New York', 'New York', 'New York', 'New York', 'New York',
'Milan', 'Milan','Milan', 'Milan','Milan', 'Milan',
'Paris', 'Paris', 'Paris', 'Paris', 'Paris', 'Paris',
'Berlin', 'Berlin','Berlin', 'Berlin','Berlin', 'Berlin'],
'year': [2000,2001, 2002, 2003, 2004, 2005,
2000,2001, 2002, 2003, 2004, 2005,
2000,2001, 2002, 2003, 2004, 2005,
2000,2001, 2002, 2003, 2004, 2005,
2000,2001, 2002, 2003, 2004, 2005],
't': [0,0,0,0,1,0,
0,0,0,0,0,1,
0,0,0,0,0,0,
0,0,1,0,0,0,
0,0,0,0,1,0]}
df = pd.DataFrame(data)
For each group id - city, I need to drop the rows for those higher years after t=1. For instance, id = A is in London in year=2004 (t=1). I want to drop the rows for the group A - London when year=2005. Please note that if an id is never in a city over 2000-2005, I want to keep all the rows (see, for instance, id = B in Milan).
The desired output:
import pandas as pd
data = {'id': ['A','A','A','A','A',
'A','A','A','A','A','A',
'B','B','B','B','B','B',
'C', 'C', 'C',
'D', 'D', 'D', 'D', 'D'],
'city':['London', 'London','London', 'London', 'London',
'New York', 'New York', 'New York', 'New York', 'New York', 'New York',
'Milan', 'Milan','Milan', 'Milan','Milan', 'Milan',
'Paris', 'Paris', 'Paris',
'Berlin', 'Berlin','Berlin', 'Berlin','Berlin'],
'year': [2000,2001, 2002, 2003, 2004,
2000,2001, 2002, 2003, 2004, 2005,
2000,2001, 2002, 2003, 2004, 2005,
2000,2001, 2002,
2000,2001, 2002, 2003, 2004],
't': [0,0,0,0,1,
0,0,0,0,0,1,
0,0,0,0,0,0,
0,0,1,
0,0,0,0,1]}
df = pd.DataFrame(data)

Idea is use cumualtive sum per groups, but need shift values and then remove all rows after first 1 in boolean indexing:
#if not sorted years per groups
#df = df.sort_values(['id','city','year'])
df = df[~df.groupby(['id', 'city'])['t'].transform(lambda x: x.shift().cumsum()).gt(0)]
print (df)
id city year t
0 A London 2000 0
1 A London 2001 0
2 A London 2002 0
3 A London 2003 0
4 A London 2004 1
6 A New York 2000 0
7 A New York 2001 0
8 A New York 2002 0
9 A New York 2003 0
10 A New York 2004 0
11 A New York 2005 1
12 B Milan 2000 0
13 B Milan 2001 0
14 B Milan 2002 0
15 B Milan 2003 0
16 B Milan 2004 0
17 B Milan 2005 0
18 C Paris 2000 0
19 C Paris 2001 0
20 C Paris 2002 1
24 D Berlin 2000 0
25 D Berlin 2001 0
26 D Berlin 2002 0
27 D Berlin 2003 0
28 D Berlin 2004 1

How to extract row with mixed value

I have to extract rows from a pandas dataframe with values in 'Date of birth' column which occur in a list with dates.
import pandas as pd
df = pd.DataFrame({'Name': ['Jack', 'Mary', 'David', 'Bruce', 'Nick', 'Mark', 'Carl', 'Sofie'],
'Date of birth': ['1973', '1999', '1995', '1992/1991', '2000', '1969', '1994', '1989/1990']})
dates = ['1973', '1992', '1969', '1989']
new_df = df.loc[df['Date of birth'].isin(dates)]
print(new_df)
0 Jack 1973
1 Mary 1999
2 David 1995
3 Bruce 1992/1991
4 Nick 2000
5 Mark 1969
6 Carl 1994
7 Sofie 1989/1990
Eventually I get the table below. As you can see, Bruce's and Sofie's rows are absent since the value is followed by / and another value. How should I split up these two filter them out?
Name Date of birth
0 Jack 1973
5 Mark 1969

You could use str.contains:
import pandas as pd
df = pd.DataFrame({'Name': ['Jack', 'Mary', 'David', 'Bruce', 'Nick', 'Mark', 'Carl', 'Sofie'],
'Date of birth': ['1973', '1999', '1995', '1992/1991', '2000', '1969', '1994', '1989/1990']})
dates = ['1973', '1992', '1969', '1989']
new_df = df.loc[df['Date of birth'].str.contains(rf"\b{'|'.join(dates)}\b")]
print(new_df)
Output
Name Date of birth
0 Jack 1973
3 Bruce 1992/1991
5 Mark 1969
7 Sofie 1989/1990
The string rf"\b{'|'.join(dates)}\b" is a regex pattern that will match any of string that contains any of the dates.

I like #DaniMesejo way better but here is a way splitting up the values and stacking:
df[df['Date of birth'].str.split('/', expand=True).stack().isin(dates).max(level=0)]
Output:
Name Date of birth
0 Jack 1973
3 Bruce 1992/1991
5 Mark 1969
7 Sofie 1989/1990

How to target specific rows for replacement with order is different between dataframes?

Say I have 2 dataframes that I want to fillna with but the order is not the exact same, is it possible to target specific columns as part of the mapping?
Here's a example:
import pandas as pd
import numpy as np
data = pd.DataFrame({'year': ['2016', '2016', '2015', '2014', '2013'],
'country':['uk', 'usa', 'fr','fr','uk'],
'rep': ['john', 'john', 'claire', 'kyle','kyle']
})
dataNew = pd.DataFrame({'year': ['2016', '2016', '2015', '2013', '2014'],
'country':['usa', 'uk', 'fr','uk','fr'],
'sales': [21,10, 20, 12,10],
'rep': [np.nan, 'john', np.nan, np.nan, 'kyle']
})
print(data)
print(dataNew)
print(dataNew.fillna(data))
My output is not right because if you see, dataframe new's country data is not in the same order(uk/us are shifted and so is fr/uk at the end). Is there a way to match it based on year, country and sales before replacing the NaN value in the rep column?
The output I'm looking for is, like the first data column. I'm trying to understand how I could have filled in the NA's by selecting a matching cells in another df. To make the question easier I made the first data column have less fields so the question is purely about mapping/searching
year country sales rep
0 2016 uk 10 john
1 2016 usa 21 john
2 2015 fr 20 claire
3 2014 fr 10 kyle
4 2013 uk 12 kyle

import pandas as pd
import numpy as np
data = pd.DataFrame({'year': ['2016', '2016', '2015', '2014', '2013'],
'country':['uk', 'usa', 'fr','fr','uk'],
'sales': [10, 21, 20, 10,12],
'rep': ['john', 'john', 'claire', 'kyle','kyle']
})
dataNew = pd.DataFrame({'year': ['2016', '2016', '2015', '2013', '2014'],
'country':['usa', 'uk', 'fr','uk','fr'],
'sales': [21,10, 20, 12,10],
'rep': [np.nan, 'john', np.nan, np.nan, 'kyle']
})
# join on these three columns and get the rep column from other dataframe
cols = ['year', 'country', 'sales']
dataNew = dataNew.drop('rep', 1).join(data.set_index(cols), on = cols)
print(data)
output:
year country sales rep
0 2016 uk 10 john
1 2016 usa 21 john
2 2015 fr 20 claire
3 2014 fr 10 kyle
4 2013 uk 12 kyle

You can also use Multi-Index along with fillna():
import pandas as pd
import numpy as np
data = pd.DataFrame({'year': ['2016', '2016', '2015', '2014', '2013'],
'country':['uk', 'usa', 'fr','fr','uk'],
'rep': ['john', 'john', 'claire', 'kyle','kyle']
})
dataNew = pd.DataFrame({'year': ['2016', '2016', '2015', '2013', '2014'],
'country':['usa', 'uk', 'fr','uk','fr'],
'sales': [21,10, 20, 12,10],
'rep': [np.nan, 'john', np.nan, np.nan, 'kyle']
})
(dataNew.set_index(['year', 'country']).sort_index().fillna(
data.set_index(['year', 'country']).sort_index())
).reset_index()
year country sales rep
0 2013 uk 12 kyle
1 2014 fr 10 kyle
2 2015 fr 20 claire
3 2016 uk 10 john
4 2016 usa 21 john

Create dictionary with two keys where both keys must be met to retrieve value

I have two dataframes:
df_one:
person city year col_x
ah bos 1998
bc bos 1996
dm ny 2001
hh la 1999
df_two:
person city range_a range_b
mk bos 1995 2004
kk bos 2004 2017
ab ny 1977 2019
fc dc 2001 2005
cc dc 2006 2019
et la 1995 2005
tr mia 1997 2006
I'd like to fill df_one, col_x with values based on conditions in both df_one and df_two. You would take the city from df_one, match the city from df_two, and based on the where the year on df_one falls in between the range from df_two - you would place the person on df_two into col_x on df_one.
Example: "ah" from the first row in df_one - the city is bos, and the year 1998 - so col_x would be mk for that row, because the city matches and 1998 falls between 1995 and 2004.
I'm not really sure where to start with this on pandas; I believe it may be some kind of nested dictionary with two values, but not sure if that's possible.

Here is a way to go about it.
First I created the data frame based on your description:
df1 = pd.DataFrame({'A': ['ah','bc','dm','hh'], 'B':['bos','bos','ny','la'], 'C': [1998,1996,2001,1999]})
df2 = pd.DataFrame({'A': ['mk','kk','ab','fc','cc','et','tr'], 'B':['bos','bos','ny','dc','dc','la','la'],'C': [1995,2004,1977,2001,2006,1995,1997], 'D':[2004,2017,2019,2005,2019,2005,2006] })
df1
A B C
0 ah bos 1998
1 bc bos 1996
2 dm ny 2001
3 hh la 1999
df2
A B C D
0 mk bos 1995 2004
1 kk bos 2004 2017
2 ab ny 1977 2019
3 fc dc 2001 2005
4 cc dc 2006 2019
5 et la 1995 2005
6 tr la 1997 2006
Then passed the rows of df1 to a function(check data). The function compares each row of df1 with all the rows in df2 and returns all the matching values from df2['A'] based on the condition you have mentioned. Please read my comment to your original question. 'la' in df1 will choose 2 values in df2.
Option1: I have chosen all the values for df1['D'] and that comes as a list.
Option 2: I have chosen only the first value out of all the matching values which is put as a singular value.
You can choose which option you want to go for or clarify further.
Option 1:
def check_data(row):
return (df2[ (df2['B'] == row['B']) & (df2['C'] <= row['C']) & (df2['D'] >= row['C'])]['A'].values)
df1['D'] = df1.apply(check_data, axis=1)
df1
A B C D
0 ah bos 1998 [mk]
1 bc bos 1996 [mk]
2 dm ny 2001 [ab]
3 hh la 1999 [et, tr]
Option 2:
def check_data(row):
return (df2[ (df2['B'] == row['B']) & (df2['C'] <= row['C']) & (df2['D'] >= row['C'])]['A'].iloc[0])
df1['D'] = df1.apply(check_data, axis=1)
df1
A B C D
0 ah bos 1998 mk
1 bc bos 1996 mk
2 dm ny 2001 ab
3 hh la 1999 et

You can use pandas dataframe merge, then use lambda expression to compute column as per your required logic.
Example:
import pandas as pd
First dataframe:
df_one = pd.DataFrame([
{'A': 'ah', 'B': 'bos', 'C': 1998, 'D': ''},
{'A': 'bc', 'B': 'bos', 'C': 1996, 'D': ''},
{'A': 'dm', 'B': 'ny', 'C': 2001, 'D': ''},
{'A': 'hh', 'B': 'la', 'C': 1999, 'D': ''},
])
print("df_one")
print(df_one)
Second dataframe:
df_two = pd.DataFrame([
{'A': 'mk', 'B': 'bos', 'C': 1995, 'D': 2004},
{'A': 'kk', 'B': 'bos', 'C': 2004, 'D': 2017},
{'A': 'ab', 'B': 'ny', 'C': 1977, 'D': 2019},
{'A': 'fc', 'B': 'dc', 'C': 2001, 'D': 2005},
{'A': 'cc', 'B': 'dc', 'C': 2006, 'D': 2019},
{'A': 'et', 'B': 'la', 'C': 1995, 'D': 2005},
{'A': 'tr', 'B': 'la', 'C': 1997, 'D': 2006},
])
print("df_two")
print(df_two)
Merge the dataframes on column B:
df_merged = pd.merge(df_one, df_two, on='B')
print("df_merged")
print(df_merged)
Perform your logic:
df_merged['D_x'] = df_merged.apply(lambda x: x['A_y'] if x['C_y'] <= x['C_x'] <= x['D_y'] else '', axis=1)
print(df_merged)
Get required columns only:
result_columns = ['A_x', 'B', 'C_x', 'D_x']
df_result = df_merged[result_columns]
Rename the columns to desired format:
df_result = df_result.rename(columns={'A_x': 'A', 'C_x': 'C', 'D_x': 'D'})
Clean up records with blank value of D:
df_result = df_result[df_result['D'] != '']
print("df_result")
print(df_result)

List of dict of dict in Pandas

I have list of dict of dicts in the following form:
[{0:{'city':'newyork', 'name':'John', 'age':'30'}},
{0:{'city':'newyork', 'name':'John', 'age':'30'}},]
I want to create pandas DataFrame in the following form:
city name age
newyork John 30
newyork John 30
Tried a lot but without any success
can you help me?

Use list comprehension with concat and DataFrame.from_dict:
L = [{0:{'city':'newyork', 'name':'John', 'age':'30'}},
{0:{'city':'newyork', 'name':'John', 'age':'30'}}]
df = pd.concat([pd.DataFrame.from_dict(x, orient='index') for x in L])
print (df)
name age city
0 John 30 newyork
0 John 30 newyork
Solution with multiple keys with new column id should be:
L = [{0:{'city':'newyork', 'name':'John', 'age':'30'},
1:{'city':'newyork1', 'name':'John1', 'age':'40'}},
{0:{'city':'newyork', 'name':'John', 'age':'30'}}]
L1 = [dict(v, id=k) for x in L for k, v in x.items()]
print (L1)
[{'name': 'John', 'age': '30', 'city': 'newyork', 'id': 0},
{'name': 'John1', 'age': '40', 'city': 'newyork1', 'id': 1},
{'name': 'John', 'age': '30', 'city': 'newyork', 'id': 0}]
df = pd.DataFrame(L1)
print (df)
age city id name
0 30 newyork 0 John
1 40 newyork1 1 John1
2 30 newyork 0 John

from pandas import DataFrame
ldata = [{0: {'city': 'newyork', 'name': 'John', 'age': '30'}},
{0: {'city': 'newyork', 'name': 'John', 'age': '30'}}, ]
# 根据上面的ldata创建一个Dataframe
df = DataFrame(d[0] for d in ldata)
print(df)
"""
The answer is:
age city name
0 30 newyork John
1 30 newyork John
"""

import pandas as pd
d = [{0:{'city':'newyork', 'name':'John', 'age':'30'}},{0:{'city':'newyork', 'name':'John', 'age':'30'}},]
df = pd.DataFrame([list(i.values())[0] for i in d])
print(df)
Output:
age city name
0 30 newyork John
1 30 newyork John

You can use:
In [41]: df = pd.DataFrame(next(iter(e.values())) for e in l)
In [42]: df
Out[42]:
age city name
0 30 newyork John
1 30 newyork John

Came to new solution. Not as straightforward as posted here but works properly
L = [{0:{'city':'newyork', 'name':'John', 'age':'30'}},
{0:{'city':'newyork', 'name':'John', 'age':'30'}}]
df = [L[i][0] for i in range(len(L))]
df = pd.DataFrame.from_records(df)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create new fake data with new primary keys from existing dataframe python - python

Related

Drop redundant rows for group in pandas

How to extract row with mixed value

How to target specific rows for replacement with order is different between dataframes?

Create dictionary with two keys where both keys must be met to retrieve value

List of dict of dict in Pandas

Categories

Resources