I am busy with making a system that can sort some things from a excel document, i have added a part of the document here: shorturl.at/DKNP7
It has the following inputs: Day, time, sort, number, gourmet/fondue, sort_exclusive
I want to have this sorted as follows, it must contain the sum of each of the different types.
I have some code but i doubt it is efficient, the start of the code i have included below.
df = pd.read_excel('Example_excel.xlsm', sheet_name="INVOER")
gourmet = df[['Day', 'Time', 'Sort', 'number', 'Gourmet/Fondue', 'sort exclusive']]
gourmet1 = gourmet.dropna(subset=['Sort'], inplace=False) #if 'Sort' is not filled in it is dropped.
gourmet1.to_excel('test.xlsx', index=False, sheet_name='gourmet')
Maybe it is needed to split it in 2 parts, where 1 part is 'exclusief' with 'sort exclusive' and another part for 'populair' and 'deluxe' from the 'sort'column.
Looking forward to your reply!
One of the things I have tried is to split it;
gourmet_pop_del = gourmet1.groupby(['Day','Sort', 'Gourmet/Fondue' ])['number'].sum()
gourmet_pop_del = gourmet_pop_del.reset_index()
gourmet_pop_del.sort_values(by=['Day', 'Sort','Gourmet/Fondue'], inplace=True)
Related
I am new to coding in this aspect and need help creating x amount of columns. I have a datagram that currently being updated and I need a way to show that whatever columns from the data frame the user picks it will show just those selected columns but in-between those columns I want a column to say 'Keep'. So far I was able to have the code select what the user wants, I am just having trouble creating a automated way to make the keep show up without adding them myself in between.
name_of_cols =['id','start_date', 'end_date', 'name', 'job_title', 'Keep']
All but Keep is part of the data frame prior.
def clean_df(df, list_col):
df2 = df.copy()
df2 = df2.drop_duplicates(list_col)
df3 = df2.copy()
df3 = df3[[id,start_date, end_date, name, job_title]].reset_index(drop = true)
df_3 = df3_new.columns.tolist()
conditions =[df3 = name_of_cols,
df3!= name_of_cols
results = ['Keep' , 'No Keep']
df3_new['Keep'] = np.select(conditions, results)
return df3[name_of_cols]
df3_new = cleanup_df(df3, name_of_cols)
This creates the list I need but when I try and add 'Keep' I get:
KeyError: Index([Keep'], dtype='object')
I am assuming this is because 'Keep is not apart of the orginal data frame.
I have code that defines all this so defining the data frames are not an issue I have.
From what I can tell, as far as your code goes, it might be a syntactical error.
results = ['Keep' , 'Don't Keep'] df3_new['keep'] = np.select(conditions, results) return df3[name_of_cols]
It seems like you have an unintended apostrophe where you have Don't Keep. I might suggest using quotation marks to eliminate this issue, but I don't know if this is the solution you are looking for. (I don't know a whole lot about data frames)
I am working with one data frame in pandas only. This error does not occur when I perform the following on a subset of this data frame (6 rows with NaN in some). And it does exactly what I needed done. In this case all the NaN in the 'Season' column got filled out properly.
Before:
Code:
s = df.set_index('Description')['Season'].dropna()
df['Season'] = df['Season'].fillna(df['Description'].map(s))
After:
Great! This is what I want to happen, one column at time. I worry about the other columns later.
But then I try the same code above on the entire data frame which is over 5000 rows long, then I get the error stated above in the title, and I am unable to pin point it to a specific row(s).
What did I try:
I removed all non-ascii characters, and these special characters: ', ", and # from the strings in the 'Description' column which sometimes has 50 characters including non-ascii and the three specific characters that I removed.
df['Description'] = df['Description'].str.encode('ascii', 'ignore').str.decode('ascii')
df['Description'] = df['Description'].str.replace('"', '')
df['Description'] = df['Description'].str.replace("'", "")
df['Description'] = df['Description'].str.replace('#', '')
But the above did not help, and I still get the error. Does anyone have additional troubleshooting tips, or know what I am failing to look for? Or ideally a solution.
The code with subset DataFrame and main DataFrame are isolated. So I am not mixing and using the 'df' and 's' interchangeably. I wish that was the problem.
Recall the subset data frame above where the code worked perfectly. Through process of elimination I discovered that when the subset data frame has one extra row - total of 8 rows, the code still works as expected. But once the 9th row is entered, the I get the error. I can't figure out why.
Then the code:
s = df.set_index('Description')['Season'].dropna()
df['Season'] = df['Season'].fillna(df['Description'].map(s))
And the data frame is updated as expected:
But when the 9th row added then the code above does not work:
Discovered how to solve the problem by adding .drop_duplicates('Description') and therefore modifying:
s= df.set_index('Description')['Season'].dropna()
to
s= df.drop_duplicates('Description').set_index('Description')['Season'].dropna()
I have the following code that has been working as intended until recently
import pandas as pd
import numpy as np
file1 = "xyz.csv"
df = pd.read_csv(file1)
pd.options.display.float_format = '{:,.2f}'.format
df.loc[~df['Ship To Customer Zip'].str.contains('[A-Za-z]'), 'ZipCleaned'] = df['Ship To Customer Zip'].str.slice(stop=5)
df.loc[df['Ship To Customer Zip'].str.contains('[A-Za-z]'), 'ZipCleaned'] = df['Ship To Customer Zip'].str.replace(' |-','')
df['revenue'] = df['revenue'].replace('\$|,','', regex=True).replace('\(','-', regex=True).replace('\)','', regex=True)
df['Customer ID'] = df['Ship To Customer'] + df['ZipCleaned']
The objective of the code is to create a column called "Customer ID" that concatenates the 'Ship to Customer' and "ZipCleaned" columns.
Problem: For zip codes where users have entered only four numbers, in some cases, the last line of code above adds a zero ("0") in front of the zip cleaned column and in other cases it doesn't. I noticed that the code just started adding the zero in front to more recent months in my database (the data goes back several years). I would prefer not to include a zero in front in cases where the zip code field only contains 4 digits.
Below is an example of the dataframe
I found a way to resolve this but i'm not sure if it's the correct approach. To remove the zero from in front of the zip codes I've added in the following code to convert the field into a string.
df['ZipCleaned'] = df['ZipCleaned'].astype(str)
Max Power, to answer your question about the 4-digit-zip codes, when i save my data in csv, it removes the "0" from those zip codes that begin with "0" there by converting them to five digits (it did this in some cases and not in other cases there by causing inconsistencies in my data set).
I have a csv dataset which for whatever reason has an extra asterisk (*) at the end of some names. I am trying to remove them, but I'm having trouble. I just want to replace the name in the case where it ends with a *, otherwise keep it as-is.
I have tried a couple variations of the following, but with little success.
import pandas as pd
people = pd.read_csv("people.csv")
people.loc[people["name"].str[-1] == "*"]] = people["name"].str[:-1]
Here I am getting the following error:
ValueError: Must have equal len keys and value when setting with an iterable
I understand why this is wrong, but I'm not sure how else to reference the values I want to change.
I could instead do something like:
starred = people.loc[people["name"].str[-1] == "*"]
starred["name"] = starred["name"].str[:-1]
I get a warning here, but this kind of works. The problem is that it only contains the previously starred people, not all of them.
I'm kind of new to this, so apologies if this is simple. I feel like it shouldn't be too hard, there should be some function to do this, but I don't know what it is.
Your syntax for pd.DataFrame.loc needs to include a column label:
df = pd.DataFrame({'name': ['John*', 'Rose', 'Summer', 'Mark*']})
df.loc[df['name'].str[-1] == '*', 'name'] = df['name'].str[:-1]
print(df)
name
0 John
1 Rose
2 Summer
3 Mark
If you only specify the first part of the indexer, you will be filtering by row label only and return a dataframe. You cannot assign a series to a dataframe.
I'm trying to use panda to do some analysis on some messaging data and am running into a few problems try to prep the data. It is coming from a database I don't have control of and therefore I need to do a little pruning and formatting before analyzing it.
Here is where I'm at so far:
#select all the messages in the database. Be careful if you get the whole test data base, may have 5000000 messages.
full_set_data = pd.read_sql("Select * from message",con=engine)
After I make this change to the timestamp, and set it as index, I'm no longer and to call to_csv.
#convert timestamp to a timedelta and set as index
#full_set_data[['timestamp']] = full_set_data[['timestamp']].astype(np.timedelta64)
indexed = full_set_data.set_index('timestamp')
indexed.to_csv('indexed.csv')
#extract the data columns I really care about since there as a bunch I don't need
datacolumns = indexed[['address','subaddress','rx_or_tx', 'wordcount'] + [col for col in indexed.columns if ('DATA' in col)]]
Here I need to format the DATA columns, I get a "SettingWithCopyWarning".
#now need to format the DATA columns to something useful by removing the upper 4 bytes
for col in datacolumns.columns:
if 'DATA' in col:
datacolumns[col] = datacolumns[col].apply(lambda x : int(x,16) & 0x0000ffff)
datacolumns.to_csv('data_col.csv')
#now group the data by "interaction key"
groups = datacolumns.groupby(['address','subaddress','rx_or_tx'])
I need to figure out how to get all the messages from a given group. get_group() requires I know key values ahead of time.
key_group = groups.get_group((1,1,1))
#foreach group in groups:
#do analysis
I have tried everything I could think of to fix the problems I'm running into but I cant seem to get around it. I'm sure it's from me misunderstanding/misusing Pandas as I'm still figuring it out.
I looking to solve these issues:
1) Can't save to csv after I add index of timestamp as timedelta64
2) How do I apply a function to a set of columns to remove SettingWithCopyWarning when reformatting DATA columns.
3) How to grab the rows for each group without having to use get_group() since I don't know the keys ahead of time.
Thanks for any insight and help so I can better understand how to properly use Pandas.
Firstly, you can set the index column(s) and parse dates while querying the DB:
indexed = pd.read_sql_query("Select * from message", engine=engine,
parse_dates='timestamp', index_col='timestamp')
Note I've used pd.read_sql_query here rather than pd.read_sql, which is deprecated, I think.
SettingWithCopy warning is due to the fact that datacolumns is a view of indexed, i.e. a subset of it's rows /columns, not an object in it's own right. Check out this part of the docs: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
One way to get around this is to define
datacolumns = indexed[<cols>].copy()
Another would to do
indexed = indexed[<cols>]
which effectively removes the columns you don't want, if you're happy that you won't need them again. You can then manipulate indexed at your leisure.
As for the groupby, you could introduce a columns of tuples which would be the group keys:
indexed['interaction_key'] = zip(indexed[['address','subaddress','rx_or_tx']]
indexed.groupby('interaction_key').apply(
lambda df: some_function(df.interaction_key, ...)
I'm not sure if it's all exactly what you want but let me know and I can edit.