How can I create a single row and get the data type, maximum column length and count for each column of a data frame as shown in bottom desired output section.
import pandas as pd
table = 'sample_data'
idx=0
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,'NULL',40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65]),
'new_column':pd.Series([])
}
#Create a DataFrame using above data
sdf = pd.DataFrame(d)
#Create a summary description
desired_data = sdf.describe(include='all').T
desired_data = desired_data.rename(columns={'index':'Variable'})
#print(summary)
#Get Data Type
dtype = sdf.dtypes
#print(data_type)
#Get total count of records (need to work on)
counts = sdf.shape[0] # gives number of row count
#Get maximum length of values
maxcollen = []
for col in range(len(sdf.columns)):
maxcollen.append(max(sdf.iloc[:,col].astype(str).apply(len)))
#print('Max Column Lengths ', maxColumnLenghts)
#Constructing final data frame
desired_data = desired_data.assign(data_type = dtype.values)
desired_data = desired_data.assign(total_count = counts)
desired_data = desired_data.assign(max_col_length = maxcollen)
final_df = desired_data
final_df = final_df.reindex(columns=['data_type','max_col_length','total_count'])
final_df.insert(loc=idx, column='table_name', value=table)
final_df.to_csv('desired_data.csv')
#print(final_df)
Output of above code:
The desired output I am looking for is :
In : sdf
Out:
table_name Name_data_type Name_total_count Name_max_col_length Age_data_type Age_total_count Age_max_col_length Rating_data_type Rating_total_count Rating_max_col_length
sample_data object 12 6 object 12 4 float64 12 4
If you have noticed, I want to print single row where I create column_name_data_type,column_name_total_count,column_name_max_col_length and get the respective values for the same.
Here's a solution:
df = final_df
df = df.drop("new_column").drop("table_name", axis=1)
df = df.reset_index()
df.melt(id_vars=["index"]).set_index(["index", "variable"]).sort_index().transpose()
The result is:
index Age Name \
variable data_type max_col_length total_count data_type max_col_length ...
value object 4 12 object 6 ...
Can you try this:
The below code tries to iterate entire dataframe, hence it may take some time complexity. This is not the optimal solution but working solution for the above problem.
from collections import OrderedDict
## storing key-value pair
result_dic = OrderedDict()
unique_table_name = final_df["table_name"].unique()
# remove unwanted rows
final_df.drop("new_column", inplace=True)
cols_name = final_df.columns
## for every unique table name, generating row
for unique_table_name in unique_table_name:
result_dic["table_name"] = unique_table_name
filtered_df = final_df[final_df["table_name"] == unique_table_name]
for row in filtered_df.iterrows():
for cols in cols_name:
if cols != "table_name":
result_dic[row[0]+"_"+cols] = row[1][cols]
Convert dict to dataframe
## convert dataframe from dict
result_df = pd.DataFrame([result_dic])
result_df
expected output is:
table_name Name_data_type Name_max_col_length Name_total_count Age_data_type Age_max_col_length Age_total_count Rating_data_type Rating_max_col_length Rating_total_count
0 sample_data object 6 12 object 4 12 float64 4 12
Related
I have a file with 136 columns. I was trying to find the unique values of each column and from there, I need to find the number of rows for the unique values.
I tried using df and dict for the unique values. However, when I export it back to csv file, the unique values are exported as a list in one cell for each column.
Is there any way I can do to simplify the counting process of the unique values in each column?
df = pd.read_excel(filename)
column_headers = list(df.columns.values)
df_unique = {}
df_count = {}
def approach_1(data):
count = 0
for entry in data:
if not entry =='nan' or not entry == 'NaN':
count += 1
return count
for unique in column_headers:
new = df.drop_duplicates(subset=unique , keep='first')
df_unique[unique] = new[unique].tolist()
csv_unique = pd.DataFrame(df_unique.items(), columns = ['Data Source Field', 'First Row'])
csv_unique.to_csv('Unique.csv', index = False)
for count in df_unique:
not_nan = approach_1(df_unique[count])
df_count[count] = not_nan
csv_count = pd.DataFrame(df_count.items(), columns = ['Data Source Field', 'Count'])
.unique() is simpler ->len(df[col].unique()) is the count
import pandas as pd
dict = [
{"col1":"0","col2":"a"},
{"col1":"1","col2":"a"},
{"col1":"2","col2":"a"},
{"col1":"3","col2":"a"},
{"col1":"4","col2":"a"},
{"col2":"a"}
]
df = pd.DataFrame.from_dict(dict)
result_dict = {}
for col in df.columns:
result_dict[col] = len(df[col].dropna().unique())
print(result_dict)
Sort data frame by values of 5th column ["Card"] and add a new row after each card nos with its count. After sorting values how can I add a new row with Total:
Dataframe looks something like this
This is how I want output data frame
You can give this a try:
import pandas as pd
# create dummy df
card = ["2222","2222","1111","2222","1111","3333"]
name = ["Ed", "Ed", "John", "Ed", "John", "Kevin"]
phone = ["1##-###-####", "1##-###-####", "2##-###-####", "1##-###-####", "2##-###-####", "3##-###-####"]
df = pd.DataFrame({"Name":name, "Phone":phone, "Card":card})
# sort by Card value
df = df.sort_values(by=["Card"]).reset_index(drop=True)
# Groupby the Card value, count them, then insert a new row based on that count
index = 0
line = []
for x in df.groupby("Card").size():
index += x
line.append(pd.DataFrame({"Name": "", "Phone":"", "Card": str(x)}, index=[index]))
df = df.append(line, ignore_index=False)
df = df.sort_values(by=["Card"]).sort_index().reset_index(drop=True)
df
Output:
Name Phone Card
0 Ed 1##-###-#### 1111
1 Ed 1##-###-#### 1111
2 Ed 1##-###-#### 1111
3 3
4 John 2##-###-#### 2222
5 John 2##-###-#### 2222
6 2
7 Kevin 3##-###-#### 3333
8 1
Edit ~~~~
Due to OP's use of string for card numbers, an edit had to be made to account for naturally sorting string ints
import pandas as pd
from natsort import natsort_keygen ##### Now needed because OP has Card numbers as strings
# create dummy df ##############
card = ["1111", "2222", "3333", "4444", "5555", "6666", "7777", "8888"]
name = ["Ed", "John", "Jake", "Mike", "Liz", "Anne", "Deb", "Steph"]
phone = ["1###", "2###", "3###", "4###", "5###", "6###", "7###", "8###"]
dfList = [a for a in zip(name, phone, card)]
dfList = [dfList[random.randrange(len(dfList))] for i in range(50)]
df = pd.DataFrame(dfList, columns=["Name", "Phone", "Card"])
################################
# sort by Card value
df = df.sort_values(by=["Card"]).reset_index(drop=True)
# Groupby the Card value, count them, then insert a new row based on that count
index = 0
line = []
for x in df.groupby("Card").size():
index += x
line.append(pd.DataFrame({"Name": "", "Phone":"", "Card": str(x)}, index=[index-1]))
df = pd.concat([df, pd.concat(line)], ignore_index=False)
# Create an Index column to be used in the by pandas sort_values
df["Index"] = df.index
# Sort the values first by index then by card number, use "natsort_keygen()" to naturally sort ints that are strings
df = df.sort_values(by = ['Index', 'Card'], key=natsort_keygen(), ascending = [True, False]).reset_index(drop=True).drop(["Index"], axis=1)
I'm not sure if this is the best way but it worked for me.
list_of_df = [v for k, v in df.groupby("card")]
for i in enumerate(list_of_df):
list_of_df[i[0]] = list_of_df[i[0]].append({"card":str(list_of_df[i[0]].shape[0])}, ignore_index=True)
final_df = pd.concat(list_of_df)
I have a ~8million-ish row data frame consisting of sales for 615 products across 16 stores each day for five years.
I need to make new column/s that consists of the sales shifted back from 1 to 7 days. I've decided to sort the data frame by date, product and location. The I concatenate item and location as its own column.
Using that column I loop through each unique item/location concatenation and make the shifted sales columns. This code is below:
import pandas as pd
#sort values by item, location, date
df = df.sort_values(['date', 'product', 'location'])
df['sort_values'] = df['product']+"_"+df['location']
df1 = pd.DataFrame()
z = 0
for i in list(df['sort_values'].unique()):
df_ = df[df['sort_values']==i]
df_ = df_.sort_values('ORD_DATE')
df_['eaches_1'] = df_['eaches'].shift(-1)
df_['eaches_2'] = df_['eaches'].shift(-2)
df_['eaches_3'] = df_['eaches'].shift(-3)
df_['eaches_4'] = df_['eaches'].shift(-4)
df_['eaches_5'] = df_['eaches'].shift(-5)
df_['eaches_6'] = df_['eaches'].shift(-6)
df_['eaches_7'] = df_['eaches'].shift(-7)
df1 = pd.concat((df1, df_))
z+=1
if z % 100 == 0:
print(z)
The above code gets me exactly what I want, but takes FOREVER to complete. Is there a faster way to accomplish what I want?
I have two .txt file where I want to separate the data frame into two parts using the first column value. If the value is less than "H1000", we want in a first dataframe and if it is greater or equal to "H1000" we want in a second dataframe.First column starts the value with H followed by a four numbers. I want to ignore H when comparing numbers less than 1000 or greater than 1000 in python.
What I have tried this thing,but it is not working.
ht_data = all_dfs.index[all_dfs.iloc[:, 0] == "H1000"][0]
print(ht_data)
Here is my code:
if (".txt" in str(path_txt).lower()) and path_txt.is_file():
txt_files = [Path(path_txt)]
else:
txt_files = list(Path(path_txt).glob("*.txt"))
for fn in txt_files:
all_dfs = pd.read_csv(fn,sep="\t", header=None) #Reading file
all_dfs = all_dfs.dropna(axis=1, how='all') #Drop the columns where all columns are NaN
all_dfs = all_dfs.dropna(axis=0, how='all') #Drop the rows where all columns are NaN
print(all_dfs)
ht_data = all_dfs.index[all_dfs.iloc[:, 0] == "H1000"][0]
print(ht_data)
df_h = all_dfs[0:ht_data] # Head Data
df_t = all_dfs[ht_data:] # Tene Data
Can anyone help me how to achieve this task in python?
Assuming this data
import pandas as pd
data = pd.DataFrame(
[
["H0002", "Version", "5"],
["H0003", "Date_generated", "8-Aug-11"],
["H0004", "Reporting_period_end_date", "19-Jun-11"],
["H0005", "State", "AW"],
["H1000", "Tene_no/Combined_rept_no", "E75/3794"],
["H1001", "Tenem_holder Magnetic Resources", "NL"],
],
columns = ["id", "col1", "col2"]
)
We can create a mask of over and under a pre set threshold, like 1000.
mask = data["id"].str.strip("H").astype(int) < 1000
df_h = data[mask]
df_t = data[~mask]
If you want to compare values of the format val = HXXXX where X is a digit represented as a character, try this:
val = 'H1003'
val_cmp = int(val[1:])
if val_cmp < 1000:
# First Dataframe
else:
# Second Dataframe
I'm looking to append a multi-index column headers to an existing dataframe, this is my current dataframe.
Name = pd.Series(['John','Paul','Sarah'])
Grades = pd.Series(['A','A','B'])
HumanGender = pd.Series(['M','M','F'])
DogName = pd.Series(['Rocko','Oreo','Cosmo'])
Breed = pd.Series(['Bulldog','Poodle','Golden Retriever'])
Age = pd.Series([2,5,4])
DogGender = pd.Series(['F','F','F'])
SchoolName = pd.Series(['NYU','UCLA','UCSD'])
Location = pd.Series(['New York','Los Angeles','San Diego'])
df = (pd.DataFrame({'Name':Name,'Grades':Grades,'HumanGender':HumanGender,'DogName':DogName,'Breed':Breed,
'Age':Age,'DogGender':DogGender,'SchoolName':SchoolName,'Location':Location}))
I want add 3 columns on top of the existing columns I already have. For example, columns [0,1,2,3] should be labeled 'People', columns [4,5,6] should be labeled 'Dogs', and columns [7,8] should be labeled 'Schools'. In the final result, it should be 3 columns on top of 9 columns.
Thanks!
IIUC, you can do:
newlevel = ['People']*4 + ['Dogs']*3 + ['Schools']*2
df.columns = pd.MultiIndex.from_tuples([*zip(newlevel, df.columns)])
Note [*zip(newlevel, df.columns)] is equivalent to
[(a,b) for a,b in zip(new_level, df.columns)]