I have a column in a dataframe (df) in pandas called "9-7". When I use df.to_csv('df.csv') to save the dataframe, the title of the column is changed to 7-Sep. It means the date of the 7 of September. However, I need the title of "9-7".
This has to do with excel's interpretation of data. A csv-file is nothing more than a table in string format.
If you are going to continue in excel you could use to.excel() function:
import pandas as pd
pd.DataFrame({'9-7':[1]}).to_csv('test.csv',index=False) # This will not work
pd.DataFrame({'9-7':[1]}).to_excel('test.xlsx',index=False) # This will
Related
I am trying to create a list from a CSV. This CSV contains a 2 dimensional table [540 rows and 8 columns] and I would like to create a list that contains the values of an specific column, column 4 to be specific.
I tried: list(df.columns.values)[4], it does mention the name of the column but i'm trying to get the values from the rows on column 4 and make them a list.
import pandas as pd
import urllib
#This is the empty list
company_name = []
#Uploading CSV file
df = pd.read_csv('Downloads\Dropped_Companies.csv')
#Extracting list of all companies name from column "Name of Stock"
companies_column=list(df.columns.values)[4] #This returns the name of the column.
companies_column = list(df.iloc[:,4].values)
So for this you can just add the following line after the code you've posted:
company_name = df[companies_column].tolist()
This will get the column data in the companies column as pandas Series (essentially a Series is just a fancy list) and then convert it to a regular python list.
Or, if you were to start from scratch, you can also just use these two lines
import pandas as pd
df = pd.read_csv('Downloads\Dropped_Companies.csv')
company_name = df[df.columns[4]].tolist()
Another option: If this is the only thing you need to do with your csv file, you can also get away just using the csv library that comes with python instead of installing pandas, using this approach.
If you want to learn more about how to get data out of your pandas DataFrame (the df variable in your code), you might find this blog post helpful.
I think that you can try this for getting all the values of a specific column:
companies_column = df[{column name}]
Replace "{column name}" with the column you want to access the values of.
I am working on a data frame uploaded from CSV, I have tried changing the data typed on the CSV file and to save it but it doesn't let me save it for some reason, and therefore when I upload it to Pandas the date and time columns appear as object.
I have tried a few ways to transform them to datetime but without a lot of success:
1) df['COLUMN'] = pd.to_datetime(df['COLUMN'].str.strip(), format='%m/%d/%Y')
gives me the error:
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
2) Defining dtypes at the beginning and then using it in the read_csv command - gave me an error as well since it does not accept datetime but only string/int.
Some of the columns I want to have a datetime format of date, such as: 2019/1/1, and some of time: 20:00:00
Do you know of an effective way of transforming those datatype object columns to either date or time?
Based on the discussion, I downloaded the data set from the link you provided and read it through pandas. I took one column and a part of it; which has the date and used the pandas data-time module as you did. By doing so I can use the script you mentioned.
#import necessary library
import numpy as np
import pandas as pd
#load the data into csv
data = pd.read_csv("NYPD_Complaint_Data_Historic.csv")
#take one column which contains the datatime as an example
dte = data['CMPLNT_FR_DT']
# =============================================================================
# I will try to take a part of the data from dte which contains the
# date time and convert it to date time
# =============================================================================
from pandas import datetime
test_data = dte[0:10]
df1 = pd.DataFrame(test_data)
df1['new_col'] = pd.to_datetime(df1['CMPLNT_FR_DT'])
df1['year'] = [i.year for i in df1['new_col']]
df1['month'] = [i.month for i in df1['new_col']]
df1['day'] = [i.day for i in df1['new_col']]
#The way you used to convert the data also works
df1['COLUMN'] = pd.to_datetime(df1['CMPLNT_FR_DT'].str.strip(), format='%m/%d/%Y')
It might be the way you get the data. You can see the output from this attached. As the result can be stored in dataframe it won't be a problem to save in any format. Please let me know if I understood correctly and it helped you. The month is not shown in the image, but you can get it.
I am a complete noob at this Python and Jupiter Notebook stuff. I am taking an Intro to Python Course and have been assigned a task to do. This is to extract information from a .csv file. The following is a snapshot of my .csv file titled "feeds1.csv"
https://i.imgur.com/BlknyC3.png
I can import the .csv into Jupyter Notebook, and have tried groupby function to sort it. But it won't work due to the fact that column also has time in it.
import pandas as pd
df = pd.read_csv("feeds1.csv")
I need it to output as follows:
https://i.imgur.com/BDfnZrZ.png
The ultimate goal would be to create a csv file with this accumulated data and use it to plot a chart,
If you do not need the time of day but just the date, you can simply use this:
df.created_at = df.created_at.str.split(' ').str[0]
dfout = df.groupby(['created_at']).count()
dfout.reset_index(level=0, inplace=True)
finaldf = dfout[['created_at', 'entry_id']]
finaldf.columns = ['Date', 'field2']
finaldf.to_csv('outputfile.csv', index=False)
The first line will split the created_at column at the space between the date and time. The .str[0] means it will only keep the first part of the split (which is the date).
The second line groups them by date and gives you the count.
When writing to csv, if you do not want the index to show (as in your pic), then use index=False. If you want the index, then just leave that portion out.
First you need to parse your date right:
df["date_string"] = df["created_at"].str.split(" ").str[0]
df["date_time"] = pd.to_datetime(df["date_string"])
# You can chose to drop earlier columns
# Now you just want to groupby with the date and apply the aggregation/function you want to
df = df.groupby(["date_time"]).sum("field2").reset_index() # for example
df.to_csv("abc.csv", index=False)
Im trying to add a new column to a pandas dataframe. Also, I try to give a name to index to be printed out in Excel when I export the data
import pandas as pd
import csv
#read csv file
file='RALS-04.csv'
df=pd.read_csv(file)
#select the columns that I want
column1=df.iloc[:,0]
column2=df.iloc[:,2]
column3=df.iloc[:,3]
column1.index.name="items"
column2.index.name="march2012"
column3.index.name="march2011"
df=pd.concat([column1, column2, column3], axis=1)
#create a new column with 'RALS' as a defaut value
df['comps']='RALS'
#writing them back to a new CSV file
with open('test.csv','a') as f:
df.to_csv(f, index=False, header=True)
The output is the 'RALS' that I added to the dataframe goes to Row 2000 while the data stops at row 15. How to constrain the RALS so that it doesnt go beyond the length of the data being exported? I would also prefer a more elegant, automated way rather than specifying at which row should the default value stops at.
The second question is, the labels that I have assigned to the columns using columns.index.name, does not appear in the output. Instead it is replaced by a 0 and a 1. Please advise solutions.
Thanks so much for inputs
Here's my problem, I have an Excel sheet with 2 columns (see below)
I'd like to print (on python console or in a excel cell) all the data under this form :
"1" : ["1123","1165", "1143", "1091", "n"], *** n ∈ [A2; A205]***
We don't really care about the Column B. But I need to add every postal code under this specific form.
is there a way to do it with Excel or in Python with Panda ? (If you have any other ideas I would love to hear them)
Cheers
I think you can use parse_cols for parse first column and then filter out all columns from 205 to 1000 by skiprows in read_excel:
df = pd.read_excel('test.xls',
sheet_name='Sheet1',
parse_cols=0,
skiprows=list(range(205,1000)))
print (df)
Last use tolist for convert first column to list:
print({"1": df.iloc[:,0].tolist()})
The simpliest solution is parse only first column and then use iloc:
df = pd.read_excel('test.xls',
parse_cols=0)
print({"1": df.iloc[:206,0].astype(str).tolist()})
I am not familiar with excel, but pandas could easily handle this problem.
First, read the excel to a DataFrame
import pandas as pd
df = pd.read_excel(filename)
Then, print as you like
print({"1": list(df.iloc[0:N]['A'])})
where N is the amount you would like to print. That is it. If the list is not a string list, you need to cast the int to string.
Also, there are a lot parameters that can control the load part of excel read_excel, you can go through the document to set suitable parameters.
Hope this would be helpful to you.