efficiently looping through pandas dataframe columns to make new dataframe - python

I want select 200 titles randomly(the correct title shall appear only once in the window) for each question and create a new dataframe. Here I am using list and for loop to do that which is taking a great of time since I have around 80k questions. All 80k questions are unique while around 8K titles are unique.
I have this following code
import random
questions = new_df['question_string'].tolist()
titles = new_df['titles'].tolist()
indexs = new_df['image_index'].tolist()
# print(len(titles))
# titles.remove(titles[0])
full_list = []
for x in range(len(questions)):
full_list.append([questions[x], titles[x], indexs[x], 1])
t = new_df.titles.unique().tolist()
if t.count(titles[x])> 0:
t.remove(titles[x])
for y in random.choices(t, k=199):
full_list.append([questions[x], y, indexs[x], 0])
len(full_list)
full_list_df = pd.DataFrame(full_list)
full_list_df.columns =['questions', 'titles', 'image_index', 'is_similar']
I need help to do this more efficiently, may be using the dataframe driectly.
This is how my dataframe looks like
question_string titles image_index is_similar
0 In how many countries, is the net taxes in con... Net taxes on products in different countries i... 33715 1
1 In how many countries, is the gross enrolment ... Total enrollments of female students in school... 68226 1
2 In how many years, is the percentage of popula... Percentage of the population living below the ... 152731 1
3 What is the ratio of the enrollment rate in pr... Net enrolment rate in primary and secondary ed... 27823 1
4 In how many countries, is the contraceptive pr... Percentage of women of different countries who... 72232 1

Related

How to determine gender count in a data grouped with groupby in python programming language?

In the data set of a market, each data belongs to a product group. I want to group this data group by fische number and find the total number of male and female customers.
The data set is as in the picture. dataset The number of unique fischeno is 141783. Therefore, the total number of customers should be 141783.
For example
If you just wanna count the total number of each gender column, you may try this:
df = pd.read_csv('./data.csv')
df = df.groupby('ficheno').first()
male_count = df[df.gender == 'm'].gender.count()
female_count = df[df.gender == 'f'].gender.count()
This will result in male = 2, female = 3 according to your sample dataset.

How to iterate through a dataframe based on two conditions?

I have a sample of companies with financial figures which I would like to compare. My data looks like this:
Cusip9 Issuer IPO Year Total Assets Long-Term Debt Sales SIC-Code
1 783755101 Ryerson Tull Inc 1996 9322000.0 2632000.0 633000.0 3661
2 826170102 Siebel Sys Inc 1996 995010.0 0.0 50250.0 2456
3 894363100 Travis Boats & Motors Inc 1996 313500.0 43340.0 23830.0 3661
4 159186105 Channell Commercial Corp 1996 426580.0 3380.0 111100.0 7483
5 742580103 Printware Inc 1996 145750.0 0.0 23830.0 8473
For every company I want to calculate a "similarity Score". This score should indicate the comparability with other companies. Therefore I want to compare them in different financial figures. The comparability should be expressed as the euclidean distance, the square root of the sum of the squared differences between the financial figures, to the "closest company". So I need to calculate the distance to every company, that fits these conditions, but only need the closest score. Assets of Company 1 minus Assets of Company 2 plus Debt Company 1 minus Debt Comapny 2....
√((x_1-y_1 )^2+(x_2-y_2 )^2)
This should only be computed for companies with the same SIC-Code and the IPO Year of the comparable companies should be smaller then for the company for which the "Similarity score" is computed. I only want to compare these companies with already listed companies.
Hopefully, my point get's clear. Has someone any idea where I can start? I am just starting with programming and completely lost with this.
Thanks in advance.
I would first create different dataframes according to the SIC-code, so every new dataframe only contains companies with the same SIC-code. Then for every of those dataframes, just double loop over the companies and compute the scores, and store them in a matrix. (So you'll end up with a symmetrical matrix of scores.)
try this , Here I have taken Compare the company with IPO Year Equal to or Smaller then since You didn't give any company record with smaller IPO year) You can change it to only Smaller than (<) in statement Group=df[...]
def closestCompany(companyRecord):
Group = df[(df['SIC-Code']==companyRecord['SIC-Code']) & (df['IPO Year'] <= companyRecord['IPO Year']) & (df['Issuer'] != companyRecord['Issuer'])]
return (((Group['Total Assets']-companyRecord['Total Assets'])**2 + (Group['Long-Term Debt'] - companyRecord['Long-Term Debt'])**2)**0.5).min()
df['Closest Company Similarity Score']=df.apply(closestCompany, axis=1)
df

How do I merge or attach a characteristic when the key I'm merging on isn't unique?

I have two different CSVs with different information that I need. The first has an account number, a ticker (mutual funds), and a dollar amount. The second has a list of Tickers and their classification (Stock, bond, etc.) I want to merge the two on the Ticker so that I have the account number, Ticker, classification, and dollar amount all together. Several of the account numbers hold the same funds, meaning the ticker will be used multiple times. When I try merging I get rows that duplicate and a lot of missing information.
I tried merging with inner and on the left. I tried making the second CSV a dictionary to reference. I attempted a for loop with lamda, but I'm pretty new to this so that didn't go well. I also tried to groupby account number and ticker before merging, but that didn't work either. The columns I'm trying to merge have the same datatype. I tried with non-float object and string.
pd.merge(df1, df2, on = 'Ticker', how = 'inner')
expected: (each account number may have 5 unique tickers)
A B C D
1 a bond 500
1 b stock 100
1 c bond 250
2 a bond 300
2 b stock 400
what I get:
A B C D
1 a bond 500
1 a bond 500
1 a bond 500
2 a bond 300
2 a bond 300
It seems to overwrite all the unique rows for the account number with the first row.

Python random.shuffle does not give exact unique values to the data frame

I am making a dummy dataset of list of companies as user_id, the jobs posted by each company as job_id and c_id as candidate id.
I have already achieved the first two steps and my dataset looks like below.
user_id job_id
0 HP HP2
1 Microsoft Microsoft4
2 Accenture Accenture2
3 HP HP0
4 Dell Dell4
5 FIS FIS1
6 HP HP0
7 Microsoft Microsoft4
8 Dell Dell2
9 Accenture Accenture0
Also they are shuffled. now i wish to add a random candidate id to this dataset in such a way that no c_id is repeated to a particular job_id.
My approach for this is as follows.
joblist is a list of all job_ids.
for i in range(50):
l = list(range(0,len(df[df['job_id'] == joblist[i]])))
random.shuffle(l)
df['c_id'][df['job_id'] == joblist[i]] = l
after which i tested it as
len(df['c_id'][df['job_id'] == joblist[0]])
output = 168
df['c_id'][df['job_id'] == joblist[0]].nunique()
output = 101
and the same is happening with all values. i have rechecked the uniqueness of l after each step and its 168 unique values.
What am i doing wrong here?
Unique IDs are provided by basic pd functions, so you don't need anything fancy. Solutions vary in efficiency based on how big your df is.
# Hashing for small datasets:
df['new_id'] = pd.factorize(df.apply(tuple, axis=1))[0] + 1
# Grouping for larger datasets:
df['new_id'] = df.groupby(df.columns.tolist(), sort=False).ngroup() + 1
# Assign:
df.assign(id=(#Some combo of columns).astype('category').cat.codes)
Further reading:
Q: [Pandas] How to efficiently assign unique ID to individuals with multiple entries based on name in very large df
How to assign a unique ID to detect repeated rows in a pandas dataframe?

Identify unique values within pandas dataframe rows that share a common id number

Here is a sample df:
data = {"Brand":{"0":"BrandA","1":"BrandA","2":"BrandB","3":"BrandB","4":"BrandC","5":"BrandC"},"Cost":{"0":18.5,"1":19.5,"2":6,"3":6,"4":17.69,"5":18.19},"IN STOCK":{"0":10,"1":15,"2":5,"3":1,"4":12,"5":12},"Inventory Number":{"0":1,"1":1,"2":2,"3":2,"4":3,"5":3},"Labels":{"0":"Black","1":"Black","2":"White","3":"White","4":"Blue","5":"Blue"},"Maximum Price":{"0":30.0,"1":35.0,"2":50,"3":45.12,"4":76.78,"5":76.78},"Minimum Price":{"0":23.96,"1":25.96,"2":12.12,"3":17.54,"4":33.12,"5":28.29},"Product Name":{"0":"Product A","1":"Product A","2":"ProductB","3":"ProductB","4":"ProductC","5":"ProductC"}}
df = pd.DataFrame(data=data)
My actual data set is much larger, but maintains the same pattern of there being 2 rows that share the same Inventory Number throughout.
My goal is to create a new data frame that contains only the inventory numbers where a cell value is not duplicated across both rows, and for those inventory numbers, only contains the data from the row with the lower index that is different from the other row.
For this example the resulting data frame would need to look like:
data = {"Inventory Number":{"0":1,"1":2,"2":3},"Cost":{"0":18.50,"1":"","2":17.69},"IN STOCK":{"0":10,"1":5,"2":""},"Maximum Price":{"0":30,"1":50,"2":""},"Minimum Price":{"0":23.96,"1":12.12,"2":33.12}}
df = pd.DataFrame(data=data)
The next time this would run, perhaps nothing changed in the "Maximum Price", so that column would need to not be included at all.
I was hoping someone would have a clean solution using groupby, but if not, i imagine the solution would include dropping all duplicates. then looping through all of the remaining inventory numbers, evaluating each column for duplicates.
icol = 'Inventory Number'
d0 = df.drop_duplicates(keep=False)
i = d0.groupby(icol).cumcount()
d1 = d0.set_index([icol, i]).unstack(icol).T
d1[1][d1[1] != d1[0]].unstack(0)
Cost IN STOCK Maximum Price Minimum Price
Inventory Number
1 19.5 15 35 25.96
2 None 1 45.12 17.54
3 18.19 None None 28.29
Try this:
In [68]: cols = ['Cost','IN STOCK','Inventory Number','Maximum Price','Minimum Price']
In [69]: df[cols].drop_duplicates(subset=['Inventory Number'])
Out[69]:
Cost IN STOCK Inventory Number Maximum Price Minimum Price
0 18.5 10 100566 30.0 23.96

Categories

Resources