I have dataframe like this.
import pandas as pd
#create dataframe
df= pd.DataFrame({"Date":range(0,22),
"Country":["USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA",],
"Number":[0,0,0,0,0,1,1,3,5,6,4,6,7,8,7,10,25,50,75,60,45,100]
"Number is Corrected":[0,0,0,0,0,1,1,3,5,6,6,6,7,7,7,10,25,50,50,60,60,100]})
But this dataframe is have a problem. Some numbers are wrong.
Previous number always has to be smaller than next number(6,4,6,,7,8,7...50,75,60,45,100)
I don't use df.sort because it's not about sorting it's about correction.
Edit: I added corrected numbers in "number is corrected" column.
guessing from your 'Number corrected' list, you could probably use this:
import pandas as pd
#create dataframe
df= pd.DataFrame({"Date":range(0,22),
"Country":["USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA",],
"Number":[0,0,0,0,0,1,1,3,5,6,4,6,7,8,7,10,25,50,75,60,45,100]})
# "Number is Corrected":[0,0,0,0,0,1,1,3,5,6,6,6,7,7,7,10,25,50,50,60,60,100]})
def correction():
df['Number is Corrected'] = df['Number']
cache = 0
for num, content in enumerate(df['Number is Corrected'], start=0):
if(df['Number is Corrected'][num] < cache):
df['Number is Corrected'][num] = cache
else:
cache = df['Number is Corrected'][num]
print(df)
if __name__ == "__main__":
correction()
but there is some inconsistency, like your conversation with jezrael. Evtl. you'll need to update the logic of the code, if it gets clearer, what the output you wished. Good luck.
Related
i'm having a problem trying to count diferents variables for the same Name. The thing is: i have a sheet with the Name of all my workers and i need to count how many trainings they had, but thoses trainings have different classifications: "Comercial", "Funcional" and others...
One of my columns is "Name" and the other is "Trainings". How can i filter those trainings and aggregate per name
import pandas as pd
import numpy as np
xls = pd.ExcelFile('BASE_Indicadores_treinamento_2021 - V3.xlsx')
df = pd.read_excel(xls, 'Base')
display(df)
df2 = df.groupby("Nome").agg({'Eixo':'count'}).reset_index()
display(df2)
What im getting is the TOTAL of trainings per Name, but i need the count of all categories i have in trainings (there are 5 of them). Does anyone know what i need to do?
Thankss
df.groupby("Nome").agg('count') should give you the total number of training for each person.
df.groupby(["Nome","Eixo"]).agg({'Eixo':'count'}) should give you the count per each person per each training.
Problem solved!
Here's what i did
import pandas as pd
import numpy as np
xls = pd.ExcelFile('BASE_Indicadores_treinamento_2021 - V3.xlsx')
df = pd.read_excel(xls, 'Base')
display(df)
filt_funcional = df['Eixo'] == 'Funcional'
filt_comercial = df['Eixo'] == 'Comercial'
filt_liderança = df['Eixo'] == 'Liderança'
filt_negocio = df['Eixo'] == 'Negócio'
filt_obr_cert = df['Eixo'] == 'Obrigatórios e Certificações'
df.loc[filt_funcional]['Nome'].value_counts()
Much easier than i thought!
And giving the credits, i only did bc of this video: https://www.youtube.com/watch?v=txMdrV1Ut64
I have a csv file that is in this format, but has thousands of rows so I can summarize it like this
id,name,score1,score2,score3
1,,3.0,4.5,2.0
2,,,,
3,,4.5,3.2,4.1
I have tried to use .dropna() but that is not working.
My desired output is
id,name,score1,score2,score3
1,,3.0,4.5,2.0
3,,4.5,3.2,4.1
All I would really need is to check if score1 is empty because if score1 is empty then the rest of the scores are empty as well.
I have also tried this but it doesn't seem to do anything.
import pandas as pd
df = pd.read_csv('dataset.csv')
df.drop(df.index[(df["score1] == '')], axis=0,inplace=True)
df.to_csv('new.csv')
Can anyone help with this?
After seeing your edits, I realized that dropna doesn't work for you because you have a None value in all of your rows. To filter for nan values in a specific column, I would recommend using the apply function like in the following code. (Btw the StackOverflow.csv is just a file where I copied and pasted your data from the question)
import pandas as pd
import math
df = pd.read_csv("StackOverflow.csv", index_col="id")
#Function that takes a number and returns if its nan or not
def not_nan(number):
return not math.isnan(number)
#Filtering the dataframe with the function
df = df[df["score1"].apply(not_nan)]
What this does is iterate through the score1 row and check if a value is NaN or not. If it is, then it returns False. We then use the list of True and False values to filter out the values from the dataframe.
import pandas as pd
df = pd.DataFrame([[1,3.0,4.5,2.0],[2],[3,4.5,3.2,4.1]], columns=["id","score1","score2","score3"])
aux1 = df.dropna()
aux2 = df.dropna(axis='columns')
aux3 = df.dropna(axis='rows')
print('=== original ===')
print(df)
print()
print('=== mode 1 ===')
print(aux1)
print()
print('=== mode 2 ===')
print(aux2)
print()
print('=== mode 3 ===')
print(aux3)
print()
print('=== mode 4 ===')
print('drop original')
df.dropna(axis=1,inplace=True)
print(df)
I am a very beginner in programming and trying to learn to code. so please bear with my bad coding. I am using pandas to find a string from a column (Combinations column in the below code ) in the data frame and print the entire row containing the string . Find the code below. Basically I need to find all the instances where the string occurs , and print the entire row .find my code below . I am not able to figure out how to find that particular instance of the column and print it .
import pandas as pd
data = pd.read_csv("signallervalues.csv",index_col=False)
data.head()
data['col1'] = data['col1'].astype(str)
data['col2'] = data['col2'].astype(str)
data['col3'] = data['col3'].astype(str)
data['col4'] = data['col4'].astype(str)
data['col5']= data['col5'].astype(str)
data.head()
combinations= data['Col1']+data['col2'] + data['col3'] + data['col4'] + data['col5']
data['combinations']= combinations
print(data.head())
list_of_combinations = data['combinations'].to_list()
print(list_of_combinations)
for i in list_of_combinations:
if data['combinations'].str.contains(i).any():
print(i+ 'data occurs in row' )
# I need to print the row containing the string here
else:
print(i +'is occuring only once')
my data frame looks like this
import pandas as pd
data=pd.DataFrame()
# recreating your data (more or less)
data['signaller']= pd.Series(['ciao', 'ciao', 'ciao'])
data['col6']= pd.Series(['-1-11-11', '11', '-1-11-11'])
list_of_combinations=['11', '-1-11-11']
data.reset_index(inplace=True)
# group by the values of column 6 and counting how many times they occur
g=data.groupby('col6')['index']
count= pd.DataFrame(g.count())
count=count.rename(columns={'index':'occurences'})
count.reset_index(inplace=True)
# create a df that keeps only the rows in the list 'list_of_combinations'
count[~count['col6'].isin(list_of_combinations)== False]
My result
I am using python3 and pandas to create a script to:
Read unstructured xsls data of varing column lengths
Total the "this", "last" and "diff" columns
Add Total under the brands columns
Dynamically bold the entire row that contains "total"
On the last point, the challenge I have been struggling with is that the row index changes depending on the data being fed in to the script. The code provided does not have a solution to this issue. I have tried every variation I can think of using style.applymap(bold) with and without variables.
Example of input
input
Example of desired outcome
outcome
Script:
import pandas as pd
import io
import sys
import warnings
def bold(val):
return 'font-weight: bold'
excel_file = 'testfile1.xlsx'
df = pd.read_excel(excel_file)
product = (df.loc[df['Brand'] == "widgit"])
product = product.append({'Brand':'Total',
'This':product['This'].sum(),
'Last':product['Last'].sum(),
'Diff':product['Diff'].sum(),
'% Chg':product['This'].sum()/product['Last'].sum()
},
ignore_index=True)
product = product.append({'Brand':' '}, ignore_index=True)
product.fillna(' ', inplace=True)
try something like this:
def highlight_max(x):
return ['font-weight: bold' if v == x.loc[4] else ''
for v in x]
df = pd.DataFrame(np.random.randn(5, 2))
df.style.apply(highlight_max)
output:
I'm new to python but I need it for a personal project. And so I have this lump of code. The function is to create a table and update it as necessary. The problem is that the table keeps being overwritten and I don't know why. Also I'm struggling with correctly assigning the starting position of the new lines to append, and that's why total (ends up overwritten as well) and pos are there, but I haven't figured out how to correctly use them. Any tips?
import datetime
import pandas as pd
import numpy as np
total ={}
entryTable = pd.read_csv("Entry_Table.csv")
newEntries = int(input("How many new entries?\n"))
for i in range(newEntries):
ID = input ("ID?\n")
VQ = int (input ("VQ?\n"))
timeStamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
entryTable.loc[i] = [timeStamp, ID, VQ]
entryTable.to_csv("Inventory_Table.csv")
total[i] = 1
pos = sum(total.values())
print(pos)
inventoryTable = pd.read_csv("Inventory_Table.csv", index_col = 0)
Your variable 'i' runs from index 0 to the number of 'newEntries'. When you add new data to row 'i' in your Pandas dataframe, you are overwriting existing data in that row. If you want to add new data, try 'n+i' where n is the initial number of entries. You can determine n with either
n = len(entryTable)
or
n = entryTable.shape[0]