I am a complete noob at this Python and Jupiter Notebook stuff. I am taking an Intro to Python Course and have been assigned a task to do. This is to extract information from a .csv file. The following is a snapshot of my .csv file titled "feeds1.csv"
https://i.imgur.com/BlknyC3.png
I can import the .csv into Jupyter Notebook, and have tried groupby function to sort it. But it won't work due to the fact that column also has time in it.
import pandas as pd
df = pd.read_csv("feeds1.csv")
I need it to output as follows:
https://i.imgur.com/BDfnZrZ.png
The ultimate goal would be to create a csv file with this accumulated data and use it to plot a chart,
If you do not need the time of day but just the date, you can simply use this:
df.created_at = df.created_at.str.split(' ').str[0]
dfout = df.groupby(['created_at']).count()
dfout.reset_index(level=0, inplace=True)
finaldf = dfout[['created_at', 'entry_id']]
finaldf.columns = ['Date', 'field2']
finaldf.to_csv('outputfile.csv', index=False)
The first line will split the created_at column at the space between the date and time. The .str[0] means it will only keep the first part of the split (which is the date).
The second line groups them by date and gives you the count.
When writing to csv, if you do not want the index to show (as in your pic), then use index=False. If you want the index, then just leave that portion out.
First you need to parse your date right:
df["date_string"] = df["created_at"].str.split(" ").str[0]
df["date_time"] = pd.to_datetime(df["date_string"])
# You can chose to drop earlier columns
# Now you just want to groupby with the date and apply the aggregation/function you want to
df = df.groupby(["date_time"]).sum("field2").reset_index() # for example
df.to_csv("abc.csv", index=False)
Related
I have this code that is getting all data from CDC. However, what I want is to get the data starting at a specific date, like all data after 04/03/2022 for example. Is it possible to do that?
#Source: https://data.cdc.gov/Vaccinations/COVID-19-Vaccinations-in-the-United-States-County/8xkx-amqh/
urlData = requests.get('https://data.cdc.gov/api/views/8xkx-amqh/rows.csv?accessType=DOWNLOAD').content
# Convert to pandas DataFrame
vcounty_df = pd.read_csv(io.StringIO(urlData.decode('utf-8')))
The server seems to serve a CSV file, so it will be difficult to only download part of the data that you want. You can try to filter every line on the fly, but the entire file will still be transferred across the Internet.
A more practical approach is to post process the data by filtering for the date range that you want. Here is how to do it.
# Create a DataFrame with "Date" column as an index
# and we will be filtering one this index. The
# dtype={"FIPS": "str"} is to suppress the mix dtype warning
# on that FIPS column.
df = pd.read_csv(
io.StringIO(urlData.decode("utf-8")),
parse_dates=["Date"],
dtype={"FIPS": "str"},
index_col=0,
)
# Filter for only 2022-04-03.
# Taken from https://stackoverflow.com/questions/22898824/filtering-pandas-dataframes-on-dates
start_from_april = df.loc["2022-04-03":]
I have a dataframe that has a column containing dates and cities, these dates are repeated every time a new city appears.
I want to leave only 4 specific dates and delete the others. I discovered a function that does this, but that runs one date at a time.
I want to create a function that does this whole process and leaves only the dates I want. Follow the df and code that eliminates one date at a time.
df[df.column != '2020-06-19']
You can do it like this.
df = df[df.column.isin(['2020-06-19', '2020-06-20', '2020-06-21'])]
or if you want to remove these dates.
df = df[~df.column.isin(['2020-06-19', '2020-06-20', '2020-06-21'])]
So what i want to do is select a column and copy the values just under the same column i select, i know i can use pandas dataframe to select the column just by the name of it, but i dont know if it's better to use openpyxl instead. There are many similar question about this but no one answer my question. Here is my code where i try to use dataframes and numpy:
for file in files:
fileName = os.path.splitext(file)[0]
if fileName == 'fileNameA':
df = pd.read_excel(file)
list_dates = ['the string of the date i need' for dates in df['Date']]
# Here what happend is
# that for every date it generates a list with dates
print(list_dates)
new_df = df.loc[np.repeat(df['Dates'], len(list_dates)]
writer = pd.ExcelWriter('fileNameA1.xlsx', engine='xlsxwriter')
new_df.to_excel(writer, 'Sheet 1')
writer.save()
except Exception as e:
print(e)
#Input data:
Date
01/12/2018
02/12/2018
03/12/2018
04/12/2018
#Output i want:
Date
01/12/2018
02/12/2018
03/12/2018
04/12/2018
01/12/2018
02/12/2018
03/12/2018
04/12/2018
Which is the best alternative, working directly with openpyxl or using pandas and then use a writer to generate the xlsx?
In this question they use df_try or concat() but how do i know the number of time i should repeat it.
Just use NewDF = pd.concat([df, df])
This will duplicate all rows of df.
If you're trying to duplicate your rows three times or some other odd interval, you could just mash together a temporary df to get the desired results (for adding two copies of df, use the following):
tempdf = pd.concat([df, df])
NewDF = pd.concat([df, tempdf])
Best is usually too subjective to be any good and it is for this reasons that questions asking for library recommendations will be closed.
If you're not doing any real manipulation of the data for statistical purposes, etc. then you probably don't need Pandas. Sticking with a single library can mean your code is easier to understand and maintain.
One approach in openpyxl would allow you to simply append() the dates at the end of the current worksheet. Something like this: (the code will probably need some changes).
for row in ws:
ws.append(row[:1])
I have a CSV file, and i want to delete some rows of it based on the values of one of the columns. I do not know the related code to delete the specific rows of a CSV file which is in type pandas.core.frame.DataFrame.
I read related questions, and i found that people suggest writing every line that is acceptable in a new file. I do not want to do that. The thing that i want is:
1) to delete the rows that I know the index of them (number of the row)
or
2) to make a new CSV in the memory of the python (not to write and again read it )
Here's an example of what you can do with pandas. If you need more detail, you might find Indexing and Selecting Data a helpful resource.
import pandas as pd
from io import StringIO
mystr = StringIO("""speed,time,date
12,22:05,1
15,22:10,1
13,22:15,1""")
# replace mystr with 'file.csv'
df = pd.read_csv(mystr)
# convert time column to timedelta, assuming mm:ss
df['time'] = pd.to_timedelta('00:'+df['time'])
# filter for >= 22:10, i.e. second item
df = df[df['time'] >= df['time'].loc[1]]
print(df)
speed time date
1 15 00:22:10 1
2 13 00:22:15 1
Python n00b, here. I'm working with event data in csv files. I am writing a script that changes the order of the columns, and sorts by time. That part of the script works, but I want to filter out certain rows based on the value of one column:
Description Date Start End Location Organization
Meeting 2/14/14 9:00 9:30 Conference Room Org1
Meeting 2/14/14 9:30 10:00 Conference Room Org2
If I don't want Org1, how do I filter out rows for that group's meetings.
I am using pandas:
import pandas as pd
df = pd.read_csv('day_of_the_week.csv')
df = df.sort('MEETING START TIME')
#saved_column = df.column_name #you can also use df['column_name']
location = df.LOCATION
date = df.DATE
starttime = df['MEETING START TIME']
endtime = df['MEETING END TIME']
description = df.DESCRIPTION
organization = df.ORGANIZATION
#write new csv file with new order of columns
df.to_csv('Full_List_sorted.csv', cols=["DATE","MEETING START TIME","MEETING END TIME","DESCRIPTION","ORGANIZATION","LOCATION"],index=False)
Thanks
To filter out those rows from df, do the following:
df = df[df["Organization"]!="Org1"]
Also, if it helps (I also started using Pandas just this week), there's a very quick and nice tutorial here:
http://manishamde.github.io/blog/2013/03/07/pandas-and-python-top-10/
(that's not me!)
Read It all. Then Create a new dataframe using pandas search. Finally store the new frame