Excuse my being a total novice. I am writing several columns of data to a CSV file where I would like to maintain the headers every time I run the script to write new data to it.
I have successfully appended data to the CSV every time I run the script, but I cannot get the data to write in a new row. It tries to extend the data on the same row. I need it to have a line break.
df = pd.DataFrame([[date, sales_sum, qty_sum, orders_sum, ship_sum]], columns=['Date', 'Sales', 'Quantity', 'Orders', 'Shipping'])
df.to_csv(r'/profit.csv', header=None, index=None, sep=',', mode='a')
I would like the headers to be on the first row "Date, Sales, Quantity, Orders, Shipping"
Second row will display the actual values.
When running the script again, I would like the third row to be appended with the next day's values only. When passing headers it seems it wants to write the headers again, then write the data again below it. I prefer only one set of headers at the top of the CSV. Is this possible?
Thanks in advance.
not sure if I completely understood what you are trying to do, but checking the documentation it seems that you have a header option that can be set to false:
https[://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html][1]
Header : bool or list of str, default True
Write out the column names. If a list of strings is given it is assumed to be
aliases for the column names.
Changed in version 0.24.0: Previously defaulted to False for Series.
Is this what you are looking for?
You can define the main dataframe with the colmns you want.
then for each day you create a dataframe of only the new rows then append it to the main row.
Like this :
Main_df = pd.DataFrame(values, columns)
New_rows = pd.DataFrame(new_values, columns)
Main_df = Main_df.append(New_rows, ignore_index=True)
For example:
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
print(df)
# A B
#0 1 2
#1 3 4
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
df = df.append(df2, ignore_index=True)
print(df)
# A B
#0 1 2
#1 3 4
#2 5 6
#3 7 8
Related
I'm trying to filter a dataframe by the first row, but can't seem to figure out how to do it.
Here's a sample version of the data I'm working with:
In [11]: df = pd.DataFrame(
...: [['Open-Ended Response', 'Open-Ended Response', 'Response', 'Response'], [1, 2, 3, 4]],
...: columns=list('ABCD'),
...: )
In [12]: df
Out[12]:
A B C D
0 Open-Ended Response Open-Ended Response Response Response
1 1 2 3 4
What I want to do is filter for all columns that start with "Response" in the first non-header row. So in this case, just have the last two columns in there own dataframe.
I can easily filter the header with something like this:
respo = [col for col in df if col.startswith('Response')]
But it doesn't seem to work if it's the 1t non-header row. Importantly, I need to keep the current header after I filter.
Thank you.
First step is to select the values of the first row:
df.iloc[0] # selects the values in the first row
Then, use python's .str StringAccessor methods for working with data values rather than column names:
df.iloc[0].str.startswith('Response') # Test the result of the above line
This will give you a Series with True/False values indexed by column name. Finally, use this to select the columns from your dataframe based on the matched labels:
df.loc[:, df.iloc[0].str.startswith('Response')] # Select columns based on the test
This should do the trick!
See pandas's docs on Indexing and Selecting Data and the StringAccessor methods for more help.
I have a dataframe which can be generated from the code as given below
df = pd.DataFrame({'person_id' :[1,2,3],'date1':
['12/31/2007','11/25/2009','10/06/2005'],'val1':
[2,4,6],'date2': ['12/31/2017','11/25/2019','10/06/2015'],'val2':[1,3,5],'date3':
['12/31/2027','11/25/2029','10/06/2025'],'val3':[7,9,11]})
I followed the below solution to convert it from wide to long
pd.wide_to_long(df, stubnames=['date', 'val'], i='person_id',
j='grp').sort_index(level=0)
Though this works with sample data as shown below, it doesn't work with my real data which has more than 200 columns. Instead of person_id, my real data has subject_ID which is values like DC0001,DC0002 etc. Does "I" always have to be numeric? Instead it adds the stub values as new columns in my dataset and has zero rows
This is how my real columns looks like
My real data might contains NA's as well. So do I have to fill them with default values for wide_to_long to work?
Can you please help as to what can be the issue? Or any other approach to achieve the same result is also helpful.
Try adding additional argument in the function which allows the strings suffix.
pd.long_to_wide(.......................,suffix='\w+')
The issue is with your column names, the numbers used to convert from wide to long need to be at the end of your column names or you need to specify a suffix to groupby. I think the easiest solution is to create a function that accepts regex and the dataframe.
import pandas as pd
import re
def change_names(df, regex):
# Select one of three column groups
old_cols = df.filter(regex = regex).columns
# Create list of new column names
new_cols = []
for col in old_cols:
# Get the stubname of the original column
stub = ''.join(re.split(r'\d', col))
# Get the time point
num = re.findall(r'\d+', col) # returns a list like ['1']
# Make new column name
new_col = stub + num[0]
new_cols.append(new_col)
# Create dictionary mapping old column names to new column names
dd = {oc: nc for oc, nc in zip(old_cols, new_cols)}
# Rename columns
df.rename(columns = dd, inplace = True)
return df
tdf = pd.DataFrame({'person_id' :[1,2,3],'h1date': ['12/31/2007','11/25/2009','10/06/2005'],'t1val': [2,4,6],'h2date': ['12/31/2017','11/25/2019','10/06/2015'],'t2val':[1,3,5],'h3date': ['12/31/2027','11/25/2029','10/06/2025'],'t3val':[7,9,11]})
# Change date columns
tdf = change_names(tdf, 'date$')
tdf = change_names(tdf, 'val$')
print(tdf)
person_id hdate1 tval1 hdate2 tval2 hdate3 tval3
0 1 12/31/2007 2 12/31/2017 1 12/31/2027 7
1 2 11/25/2009 4 11/25/2019 3 11/25/2029 9
2 3 10/06/2005 6 10/06/2015 5 10/06/2025 11
This is quite late to answer this question. But putting the solution here in case someone else find it useful
tdf = pd.DataFrame({'person_id' :[1,2,3],'h1date': ['12/31/2007','11/25/2009','10/06/2005'],'t1val': [2,4,6],'h2date': ['12/31/2017','11/25/2019','10/06/2015'],'t2val':[1,3,5],'h3date': ['12/31/2027','11/25/2029','10/06/2025'],'t3val':[7,9,11]})
## You can use m13op22 solution to rename your columns with numeric part at the
## end of the column name. This is important.
tdf = tdf.rename(columns={'h1date': 'hdate1', 't1val': 'tval1',
'h2date': 'hdate2', 't2val': 'tval2',
'h3date': 'hdate3', 't3val': 'tval3'})
## Then use the non-numeric portion, (in this example 'hdate', 'tval') as
## stubnames. The mistake you were doing was using ['date', 'val'] as stubnames.
df = pd.wide_to_long(tdf, stubnames=['hdate', 'tval'], i='person_id', j='grp').sort_index(level=0)
print(df)
I am trying to process a CSV file into a new CSV file with only columns of interest and remove rows with unfit values of -1. Unfortunately I get unexpected results, as it automatically includes column 0 (old ID) into the new CSV file without explicitly asking the script to do it (as it is not defined in cols = [..]).
How could I change these values for the new row count. That for, when for example we remove row 9 with an id=9, the dataset id goes currently as [..7,8,10...] instead of a new id count as [..7,8,9,10...]. I hope anyone got a solution for it.
import pandas as pd
# take only specific columns from dataset
cols = [1, 5, 6]
data = pd.read_csv('data_sample.csv', usecols=cols, header=None) data.columns = ["url", "gender", "age"]
# remove rows from dataset with undefined values of -1
data = data[data['gender'] != -1]
data = data[data['age'] != -1]
""" Additional working solution
indexGender = data[data['gender'] == -1].index
indexAge = data[data['age'] == -1].index
# Delete the rows indexes from dataFrame
data.drop(indexGender,inplace=True)
data.drop(indexAge, inplace=True)
"""
data.to_csv('data_test.csv')
Thank you in advance.
I solved the problem via simple line after the data drop:
data.reset_index(drop=True, inplace=True)
l have a csv file that l process with pandas. l have for columns as follow :
df.columns = ["id", "ocr", "raw_value", "manual_raw_value"]
However , l have some rows which have more than five columns . For instance :
id ocr raw_value manual_raw_value
2d704f42 OMNIPAGE remuneration rémunération hello
bfa6c9f14 OMNIPAGE 35470 35470
213e1e1e OMNIPAGE Echeance Echéance
l did the following in order not to read the rows with extra columns (like the first row)
df = pd.read_csv(filename, sep=",",index_col=None, error_bad_lines=False)
However the the rows with extra columns are kept.
Thank you
Another try. For easier indexing, I would rename columns, even those which are unnecessary:
df.columns = range(0, df.shape[1])
I assume, that empty places are NaN, so valid rows will have all NaN in other columns. I was not successful in searching for specific function, so I would interate through single columns and leave only those with NaN and pick only needed columns:
for i in range(4, df.shape[1]):
df = df[df.iloc[:,i].isnull()]
df = df[[0, 1, 2, 3]]
Then rename them how you want. Hope this will help.
How can I add a header to a DF without replacing the current one? In other words I just want to shift the current header down and just add it to the dataframe as another record.
*secondary question: How do I add tables (example dataframe) to stackoverflow question?
I have this (Note header and how it is just added as a row:
0.213231 0.314544
0 -0.952928 -0.624646
1 -1.020950 -0.883333
I need this (all other records are shifted down and a new record is added)
(also: I couldn't read the csv properly because I'm using s3_text_adapter for the import and I couldn't figure out how to have an argument that ignores header similar to pandas read_csv):
A B
0 0.213231 0.314544
1 -1.020950 -0.883333
Another option is to add it as an additional level of the column index, to make it a MultiIndex:
In [11]: df = pd.DataFrame(randn(2, 2), columns=['A', 'B'])
In [12]: df
Out[12]:
A B
0 -0.952928 -0.624646
1 -1.020950 -0.883333
In [13]: df.columns = pd.MultiIndex.from_tuples(zip(['AA', 'BB'], df.columns))
In [14]: df
Out[14]:
AA BB
A B
0 -0.952928 -0.624646
1 -1.020950 -0.883333
This has the benefit of keeping the correct dtypes for the DataFrame, so you can still do fast and correct calculations on your DataFrame, and allows you to access by both the old and new column names.
.
For completeness, here's DSM's (deleted answer), making the columns a row, which, as mentioned already, is usually not a good idea:
In [21]: df_bad_idea = df.T.reset_index().T
In [22]: df_bad_idea
Out[22]:
0 1
index A B
0 -0.952928 -0.624646
1 -1.02095 -0.883333
Note, the dtype may change (if these are column names rather than proper values) as in this case... so be careful if you actually plan to do any work on this as it will likely be slower and may even fail:
In [23]: df.sum()
Out[23]:
A -1.973878
B -1.507979
dtype: float64
In [24]: df_bad_idea.sum() # doh!
Out[24]: Series([], dtype: float64)
If the column names are actually a row that was mistaken as a header row then you should correct this on reading in the data (e.g. read_csv use header=None).
The key is to specify header=None and use column to add header:
data = pd.read_csv('file.csv', skiprows=2, header=None ) # skip blank rows if applicable
df = pd.DataFrame(data)
df = df.iloc[ : , [0,1]] # columns 1 and 2
df.columns = ['A','B'] # title