Python : get random data from dataframe pandas - python

Have a df with values :
name algo accuracy
tom 1 88
tommy 2 87
mark 1 88
stuart 3 100
alex 2 99
lincoln 1 88
How to randomly pick 4 records from df with a condition that at least one record should be picked from each unique algo column values. here, algo column has only 3 unique values (1 , 2 , 3 )
Sample outputs:
name algo accuracy
tom 1 88
tommy 2 87
stuart 3 100
lincoln 1 88
sample output2:
name algo accuracy
mark 1 88
stuart 3 100
alex 2 99
lincoln 1 88

One way
num_sample, num_algo = 4, 3
# sample one for each algo
out = df.groupby('algo').sample(n=num_sample//num_algo)
# append one more sample from those that didn't get selected.
out = out.append(df.drop(out.index).sample(n=num_sample-num_algo) )
Another way is to shuffle the whole data, enumerate the rows within each algo, sort by that enumeration and take the required number of samples. This is slightly more code than the first approach, but is cheaper and produces more balanced algo counts:
# shuffle data
df_random = df['algo'].sample(frac=1)
# enumerations of rows with the same algo
enums = df_random.groupby(df_random).cumcount()
# sort with `np.argsort`:
enums = enums.sort_values()
# pick the first num_sample indices
# these will be indices of the samples
# so we can use `loc`
out = df.loc[enums.iloc[:num_sample].index]

Related

How to reverse the order of a specific column in python

I have extracted some data online and I would like to reverse the first column order.
Here is my code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
data = []
soup = BeautifulSoup(requests.get("https://kworb.net/spotify/country/us_weekly.html").content, 'html.parser')
for e in soup.select('#spotifyweekly tr:has(td)'):
data.append({
'Frequency':e.td.text,
'Artists':e.a.text,
'Songs':e.a.find_next_sibling('a').text
})
data2 = data[:100]
print(data2)
data = pd.DataFrame(data2).to_excel('Kworb_Weekly.xlsx', index = False)
And here is my output:
[![enter image description here][1]][1]
[1]: https://i.stack.imgur.com/TmGmI.png
I've used [::-1], but it reversed all the columns and I just only want to reverse the first column.
Your first column is 'Frequency', so you can get that column from the data frame, and use [::] on both sides:
data = pd.DataFrame(data2)
print(data)
data['Frequency'][::1] = data['Frequency'][::-1]
print(data)
Got this as the output:
Frequency Artists Songs
0 1 SZA Kill Bill
1 2 PinkPantheress Boy's a liar Pt. 2
2 3 Miley Cyrus Flowers
3 4 Morgan Wallen Last Night
4 5 Lil Uzi Vert Just Wanna Rock
.. ... ... ...
95 96 Lizzo Special
96 97 Glass Animals Heat Waves
97 98 Frank Ocean Pink + White
98 99 Foo Fighters Everlong
99 100 Meghan Trainor Made You Look
[100 rows x 3 columns]
Frequency Artists Songs
0 100 SZA Kill Bill
1 99 PinkPantheress Boy's a liar Pt. 2
2 98 Miley Cyrus Flowers
3 97 Morgan Wallen Last Night
4 96 Lil Uzi Vert Just Wanna Rock
.. ... ... ...
95 5 Lizzo Special
96 4 Glass Animals Heat Waves
97 3 Frank Ocean Pink + White
98 2 Foo Fighters Everlong
99 1 Meghan Trainor Made You Look
[100 rows x 3 columns]
Process finished with exit code 0

how to compare two csv file in python and flag the difference?

i am new to python. Kindly help me.
Here I have two set of csv-files. i need to compare and output the difference like changed data/deleted data/added data. here's my example
file 1:
Sn Name Subject Marks
1 Ram Maths 85
2 sita Engilsh 66
3 vishnu science 50
4 balaji social 60
file 2:
Sn Name Subject Marks
1 Ram computer 85 #subject name have changed
2 sita Engilsh 66
3 vishnu science 90 #marks have changed
4 balaji social 60
5 kishor chem 99 #added new line
Output - i need to get like this :
Changed Items:
1 Ram computer 85
3 vishnu science 90
Added item:
5 kishor chem 99
Deleted item:
.................
I imported csv and done the comparasion via for loop with redlines. I am not getting the desire output. its confusing me a lot when flagging the added & deleted items between file 1 & file2 (csv files). pl suggest the effective code folks.
The idea here is to flatten your dataframe with melt to compare each value:
# Load your csv files
df1 = pd.read_csv('file1.csv', ...)
df2 = pd.read_csv('file2.csv', ...)
# Select columns (not mandatory, it depends on your 'Sn' column)
cols = ['Name', 'Subject', 'Marks']
# Flat your dataframes
out1 = df1[cols].melt('Name', var_name='Item', value_name='Old')
out2 = df2[cols].melt('Name', var_name='Item', value_name='New')
out = pd.merge(out1, out2, on=['Name', 'Item'], how='outer')
# Flag the state of each item
condlist = [out['Old'] != out['New'],
out['Old'].isna(),
out['New'].isna()]
out['State'] = np.select(condlist, choicelist=['changed', 'added', 'deleted'],
default='unchanged')
Output:
>>> out
Name Item Old New State
0 Ram Subject Maths computer changed
1 sita Subject Engilsh Engilsh unchanged
2 vishnu Subject science science unchanged
3 balaji Subject social social unchanged
4 Ram Marks 85 85 unchanged
5 sita Marks 66 66 unchanged
6 vishnu Marks 50 90 changed
7 balaji Marks 60 60 unchanged
8 kishor Subject NaN chem changed
9 kishor Marks NaN 99 changed
count, flag = 0, 1
for i, j in zip(df1.values, df2.values):
if sum(i == j) != 4:
if flag:
print("Changed Items:")
flag = 0
print(j)
count += 1
if count != len(df2):
print("Newly added:")
print(*df2.iloc[count:, :].values)

Set multiple columns to zero based on a value in another column [duplicate]

This question already has answers here:
Change one value based on another value in pandas
(7 answers)
Closed 2 years ago.
I have a sample dataset here. In real case, it has a train and test dataset. Both of them have around 300 columns and 800 rows. I want to filter out all those rows based on a certain value in one column and then set all values in that row from column 3 e.g. to column 50 to zero. How can I do it?
Sample dataset:
import pandas as pd
data = {'Name':['Jai', 'Princi', 'Gaurav','Princi','Anuj','Nancy'],
'Age':[27, 24, 22, 32,66,43],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj', 'Katauj', 'vbinauj'],
'Payment':[15,20,40,50,3,23],
'Qualification':['Msc', 'MA', 'MCA', 'Phd','MA','MS']}
df = pd.DataFrame(data)
df
Here is the output of sample dataset:
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 Kanpur 20 MA
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 Kannauj 50 Phd
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS
As you can see, in the first column, there values =="Princi", So if I find rows that Name column value =="Princi", then I want to set column "Address" and "Payment" in those rows to zero.
Here is the expected output:
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 0 0 MA #this row
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 0 0 Phd #this row
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS
In my real dataset, I tried:
train.loc[:, 'got':'tod']# get those columns # I could select all those columns
and train.loc[df['column_wanted'] == "that value"] # I got all those rows
But how can I combine them? Thanks for your help!
Use the loc accessor; df.loc[boolean selection, columns]
df.loc[df['Name'].eq('Princi'),'Address':'Payment']=0
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 0 0 MA
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 0 0 Phd
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS

80 Gb file - Creating a data frame that submits data based upon a list of counties

I am working with an 80 Gb data set in Python. The data has 30 columns and ~180,000,000 rows.
I am using the chunk size parameter in pd.read_csv to read the data in chunks where I then iterate through the data to create a dictionary of the counties with their associated frequency.
This is where I am stuck. Once I have the list of counties, I want to iterate through the chunks row-by-row again summing the values of 2 - 3 other columns associated with each county and place it into a new DataFrame. This would roughly be 4 cols and 3000 rows which is more manageable for my computer.
I really don't know how to do this, this is my first time working with a large data set in python.
import pandas as pd
from collections import defaultdict
df_chunk = pd.read_csv('file.tsv', sep='\t', chunksize=8000000)
county_dict = defaultdict(int)
for chunk in df_chunk:
for county in chunk['COUNTY']:
county_dict[county] += 1
for chunk in df_chunk:
for row in chunk:
# I don't know where to go from here
I expect to be able to make a DataFrame with a column of all the counties, a column for total sales of product "1" per county, another column for sales of product per county, and then more columns of the same as needed.
The idea
I was not sure whether you have data for different counties (e.g. in UK or USA)
or countries (in the world), so I decided to have data concerning countries.
The idea is to:
Group data from each chunk by country.
Generate a partial result for this chunk, as a DataFrame with:
Sums of each column of interest (per country).
Number of rows per country.
To perform concatenation of partial results (in a moment), each partial
result should contain the chunk number, as an additional index level.
Concatenate partial results vertically (due to the additional index level,
each row has different index).
The final result (total sums and row counts) can be computed as
sum of the above result, grouped by country (discarding the chunk
number).
Test data
The source CSV file contains country names and 2 columns to sum (Tab separated):
Country Amount_1 Amount_2
Austria 41 46
Belgium 30 50
Austria 45 44
Denmark 31 42
Finland 42 32
Austria 10 12
France 74 54
Germany 81 65
France 40 20
Italy 54 42
France 51 16
Norway 14 33
Italy 12 33
France 21 30
For the test purpose I assumed chunk size of just 5 rows:
chunksize = 5
Solution
The main processing loop (and preparatory steps) are as follows:
df_chunk = pd.read_csv('Input.csv', sep='\t', chunksize=chunksize)
chunkPartRes = [] # Partial results from each chunk
chunkNo = 0
for chunk in df_chunk:
chunkNo += 1
gr = chunk.groupby('Country')
# Sum the desired columns and size of each group
res = gr.agg(Amount_1=('Amount_1', sum), Amount_2=('Amount_2', sum))\
.join(gr.size().rename('Count'))
# Add top index level (chunk No), then append
chunkPartRes.append(pd.concat([res], keys=[chunkNo], names=['ChunkNo']))
To concatenate the above partial results into a single DataFrame,
but still with separate results from each chunk, run:
chunkRes = pd.concat(chunkPartRes)
For my test data, the result is:
Amount_1 Amount_2 Count
ChunkNo Country
1 Austria 86 90 2
Belgium 30 50 1
Denmark 31 42 1
Finland 42 32 1
2 Austria 10 12 1
France 114 74 2
Germany 81 65 1
Italy 54 42 1
3 France 72 46 2
Italy 12 33 1
Norway 14 33 1
And to generate the final result, summing data from all chunks,
but keeping separation by countries, run:
res = chunkRes.groupby(level=1).sum()
The result is:
Amount_1 Amount_2 Count
Country
Austria 96 102 3
Belgium 30 50 1
Denmark 31 42 1
Finland 42 32 1
France 186 120 4
Germany 81 65 1
Italy 66 75 2
Norway 14 33 1
To sum up
Even if we look only on how numbers of rows per country are computed,
this solution is more "pandasonic" and elegant, than usage of defaultdict
and incrementation in a loop processing each row.
Grouping and counting of rows per group works significantly quicker
than a loop operating on rows.

Split one table into multiple table based on one column [duplicate]

This question already has an answer here:
Convert pandas.groupby to dict
(1 answer)
Closed 4 years ago.
Given a table (/dataFrame) x:
name day earnings revenue
Oliver 1 100 44
Oliver 2 200 69
John 1 144 11
John 2 415 54
John 3 33 10
John 4 82 82
Is it possible to split the table into two tables based on the name column (that acts as an index), and nest the two tables under the same object (not sure about the exact terms to use). So in the example above, tables[0] will be:
name day earnings revenue
Oliver 1 100 44
Oliver 2 200 69
and tables[1] will be:
name day earnings revenue
John 1 144 11
John 2 415 54
John 3 33 10
John 4 82 82
Note that that the number of rows in each 'sub-table' may vary.
Cheers,
Create dictionary of DataFrames:
dfs = dict(tuple(df.groupby('name')))
And then select by keys - value of name column:
print (dfs['Oliver'])
print (dfs['John'])

Categories

Resources