How to measure data completeness in pandas [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I'm using pandas and I have a dataset containing 20 columns and 65 rows. What I'm trying to do is to try to measure the data completeness. So, I want to check the percentage of NaN values compared to the whole dataset. For example, the output I need is: The percentage of NaNs in the dataset is: 40%
I've counted the number of NaNs by doing the following: comp_df.isna().sum().sum() and got a result of 776. But, I don't know what to do next.

Use:
comp_df = pd.DataFrame(dict(a=[np.nan,1,1],
b=[np.nan,np.nan,np.nan]))
print (comp_df)
a b
0 NaN NaN
1 1.0 NaN
2 1.0 NaN
In your solution is possible divide by DataFrame.size for number of all values:
print (comp_df.isna().sum().sum() / comp_df.size * 100)
66.66666666666666
Or reshape values to Series, by DataFrame.stack and use mean, what is sum/count by definition:
print (comp_df.isna().stack().mean() * 100)
66.66666666666666

Related

best way of inserting my stock array data into database [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 16 days ago.
Improve this question
My array is given (stock data) :
struction of my_data = [date, (price:volume), ..., (price:volume)]
For example
my_data =
["2022-12-01", (2000:157),(2005:5), (2010:23654), (2050:132)]
["2022-12-02", (1990:4),(2000:123)]
["2022-12-03", (2010:11),(2005:12100),(2050,342)]
["2022-12-04", (2080:1230),(2090:55),(3010,34212354),(3050,29873)]
As you see, the dimensions of (price:volume) are arbitrary.
What is best way of creating database whose data array is like above ?
You can create a table this way.
date
price
volume
2022-12-01
2000
157
2022-12-01
2000
5
2022-12-02
1990
4
This allows you to query based on the date individually, so you can easily filter a day's records or records in between a range of dates.

How to merge 3 different pandas dataframes? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I have 3 pandas dataframes as :
abc
xyz
colour
type
pattern
colour
type
pattern
lenght
breadth
height
area
lenght
breadth
height
area
I want to combine the dataframes so that it looks like this :
abc colour length
type breadth
pattern height
area
xyz colour length
type breadth
pattern height
area
I also want to export the end result to an excel sheet so how do i do that without making it look messy?
first concat second and third df with rows of first df and concat them:
df1 = pd.DataFrame([['abc'],['xyz']],columns=['col1'])
df2 = pd.DataFrame([['colour'],
['type'],
['pattern']],columns=['col2'])
df3 = pd.DataFrame([['lenght'],
['breadth'],
['height'],
['area']],columns=['col3'])
pd.concat([pd.concat([df[lambda x: x.index == i].reset_index(),df2,df3],axis=1) for i in range(len(df1))]).drop('index',axis=1)

How do I get the sum of column from a csv within specified rows using dates in python? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
Date,hrs,Count,Status
2018-01-02,4,15,SFZ
2018-01-03,5,16,ACZ
2018-01-04,3,14,SFZ
2018-01-05,5,15,SFZ
2018-01-06,5,18,ACZ
This is the fraction of data to what I've been working on. The actual data is in the same format with around 1000 entries of each date in it. I am taking the start_date and end_date as inputs from user. Consider in this case it is:
start_date:2018-01-02
end_date:2018-01-06
So, I have to display a total for hrs and the count within the selected date range, on the output. Also I want to do it using an #app.callback in dash(plot.ly). Can someone help please?
Use Series.between with filtering by DataFrame.loc and boolean indexing for columns by condition and then sum:
df = df.loc[df['Date'].between('2018-01-02','2018-01-06'), ['hrs','Count']].sum()
print (df)
hrs 22
Count 78
dtype: int64

Python method to display dataframe rows with least common column string [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have a dataframe with 3 columns (department, sales, region), and I want to write a method to display all rows that are from the least common region. Then I need to write another method to count the frequency of the departments that are represented in the least common region. No idea how to do this.
Functions would be unecessary - pandas already has implementations to accomplish what you want! Suppose I had the following csv file, test.csv...
department,sales,region
sales,26,midwest
finance,45,midwest
tech,69,west
finance,43,east
hr,20,east
sales,34,east
If I'm understanding you correctly, I would obtain a DataFrame representing the least common region like so:
import pandas as pd
df = pd.read_csv('test.csv')
counts = df['region'].value_counts()
least_common = counts[counts == counts.min()].index[0]
least_common_df = df.loc[df['region'] == least_common]
least_common_df is now:
department sales region
2 tech 69 west
As for obtaining the department frequency for the least common region, I'll leave that up to you. (I've already shown you how to get the frequency for region.)

Calculating mean of each row, ignoring 0 values in python [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have a data frame with 1000 rows and 10 columns.
3 of these columns are 'total_2013', 'total_2014' and 'total_2015'
I would like to create a new column, containing the average of total over these 3 years for each row, but ignoring any 0 values.
If you are using pandas:
Use DataFrame.mean leveraging the skipna attribute.
First replace 0 with None using:
columns = ['total_2013', 'total_2014', 'total_2015']
df[columns].replace(0, None)
Then compute the mean:
df["total"] = df[columns].mean(
axis=1, # columns mean
skipna=True # skip nan values
)

Categories

Resources