I'm completely new to coding and learning on the go and need some advice.
I have a dataset imported into jupyter notebook from an excel.csv file. The column headers are all dates in the format "1/22/20" (22nd January 2020) and I want them to read as "Day1", "Day2", "Day3" etc. I have changed them manually to read as I want them but the csv file updates with a new column every day, which means that when I read it into my notebook to produce the graphs I want I first have to update the code in my notebook and add the extra "Dayxxx". This isn't a major problem but I have now 92 days in the csv file/dataset and it's getting boring. I wondered if there is a way to automatically add the "Dayxxx" by reading the file and with a for or while loop to change the column headers.
Any advice gratefully recieved, thanks.
Steptho.
I understand that those are your only columns and they are already ordered from first to last day?
You can obtain the number of days by getting the length of the list of column names returned by df.columns. From there you can create a new list with the column names you desire.
import pandas as pd
df = pd.read_csv("your_csv")
no_columns = len(df.columns)
new_column_names = []
for day in range(no_columns):
new_column_names.append("Day "+str(day+1))
df.columns = new_column_names
Related
Lets say I have a CSV which is generated yearly by my business. Each year my business decides there is a new type of data we want to collect. So Year2002.csv looks like this:
Age,Gender,Address
A,B,C
Then year2003.csv adds a new column
Age,Gender,Address,Location,
A,B,C,D
By the time we get to year 2021, my CSV now has 7 columns and looks like this:
Age,Gender,Address,Location,Height,Weight,Race
A,B,C,D,E,F,G,H
My business wants to create a single CSV which contains all of the data recorded. Where data is not available, (for example, Address data is not recorded in the 2002 CSV) there can be a 0 or a NAAN or a empty cell.
What is the best method available to merge the CSV's into a single CSV? It may be worthwhile saying, that I have 15,000 CSV files which need to be merged. ranging from 2002-2021. 2002 the CSV starts off with three columns, but by 2020, the csv has 10 columns. I want to create one 'master' spreadsheet which contains all of the data.
Just a little extra context... I am doing this because I will then be using Python to replace the empty values using the new data. E.g. calculate an average and replace CSV empty values with that average.
Hope this makes sense. I am just looking for some direction on how to best approach this. I have been playing around with excel, power bi and python but I can not figure out the best way to do this.
With pandas you can use pandas.read_csv() to create Dataframe, which you can merge using pandas.concat().
import pandas as pd
data1 = pd.read_csv(csv1)
data2 = pd.read_csv(csv2)
data = pd.concat(data1, data2)
You should take a look at python csv module.
A good place to start: https://www.geeksforgeeks.org/working-csv-files-python/
It is simple and useful for reading CSVs and creating new ones.
I imported a .csv file with a single column of data into a dataframe that I am trying to clean up by splitting the column based on various string occurrences within the cells. I've tried numerous means to split the column, but can't seem to get it to work. My latest attempt was using the following:
df.loc[:,'DataCol'] = df.DataCol.str.split(pat=':\n',expand=True)
df
The result is a dataframe that is still one column and completely unchanged. What am I doing wrong? This is my first time doing anything like this so please forgive the simple question.
Df.loc creates a copy of the column you've selected - try replacing the code below with df['DataCol'], which references the actual column in the original dataframe.
df.loc[:,'DataCol']
I have a table (Tab delimited .txt file) in the following form:
each row is an entry;
first row are headers
the first 5 columns are simple numeric parameters
all column after the 7th column are supposed to be a list of values
My problem is how can I import and create a data frame where the last column contain a list of values?
-----Problem 1 ----
The header (first row) is "shorter", containing simply the name of some columns. All the values after the 7th do not have a header (because it is suppose to be a list). If I import the file as is, this appear to confuse the import functions
If, for example, I import as follow
df = pd.read_table( path , sep="\t")
the DataFrame created has only as many columns as the elements in the first row. Moreover, the data value assigned are mismatched.
---- Problem 2 -----
What is really confusing to me is that if I open the .txt in Excel and save it as Tab-delimited (without changing anything), I can then import it without problems, with headers too: columns with no header are simply given an “Unnamed XYZ” tag.
Why would saving in Excel change it? Using Note++ I can see only one difference: the original .txt is in "Unix (LF)" form, while the one saved in Excel is "Windows (CR LF)". Both are UTF-8, so I do not understand how this would be an issue?!?
Nevertheless, from here I could manipulate the data and try to gather all columns I wish and make them into a list. However, I hope that there is a more elegant and faster way to do it.
Here is a screen-shot of the .txt file
Thank you,
I don't know whether this is a very simple qustion, but I would like to do a condition statement based on two other columns.
I have two columns like: the age and the SES and the another empty column which should be based on these two columns. For example when one person is 65 years old and its corresponding socio-economic status is high, then in the third column(empty column=vitality class) a value of 1 is for example given. I have got an idea about what I want to achieve, however I have no idea how to implement that in python itself. I know I should use a for loop and I know how to write conditons, however due to the fact that I want to take two columns into consideration for determining what will be written in the empty column, I have no idea how to write that in a function
and furthermore how to write back into the same csv (in the respective empty column)
[]
Use the pandas module to import the csv as a DataFrame object. Then you can do logical statements to fill empty columns:
import pandas as pd
df = pd.read_csv('path_to_file.csv')
df.loc[(df['age']==65) & (df['SES']=='high'), 'vitality_class'] = 1
df.to_csv('path_to_new_file.csv', index=False)
I am fairly new to Python and Pandas and I have not found an answer to this question while searching.
I have multiple csv data files that all contain a date-time column and corresponding data. I wanted to create a series/dataframe that contains a specific span of dates (all data is 1 min interval, so if I wanted to look at July for example I would set the index to start at July and go until the end).
Can I create a series or dataframe that contains only the date-time intervals as an index and does not contain column info? Or would I create an index (the row numbers) and then fill my column with the dates.
I also am unsure of using 'pd.merge' vs 'newdataframe = pd.merge'. When using just pd.merge, nothing comes up in my variable explorer (I use Anaconda's Spyder IDE), only when I use newdataframe = pd.merge does it appear.
Thanks in advance,