how to extract specific data from csv file like"Name", "Address"? - python

Name
4 A-------
5 ---
6 Father Name
7 ------
8 Gender
9 Country of
10 M
11 Oman
12 Identity Number -n?
13 Date of Birth
14 ------------9
15 28.10.1995
16 ----
17 Date of Issue
18 Date of Expiry

To extract a specific column from a csv file you can simply use the iloc function from the pandas library after reading the initial csv file.
dataset = pd.read_csv("path_of_csv")
# Now once you've read the original csv file you can slice along the columns
# to get the desired column (Example: Name, 1st column)
Name = dataset.iloc[:,0]
Or if you use an older version of pandas, this just might work:
(Definitely works for pandas version 1.3.5)
dataset = pd.read_csv("path_of_csv")
Name = dataset['Name']

Related

Append a variable to each column in a dataframe object

Trying to solve a trading problem, but rephrasing it in a different way.
I have an array of countries as
countries = {'country_name': ['France','Germany','Italy','Japan']}
For each country, I have a CSV stored on my laptop. Each CSV has 3 columns [Date, Birth, Death].
I am making for loop on Array and reading the CSV and creating a dataframe object.
countries = {'country_name': ['France','Germany','Italy','Japan']}
countries = pd.DataFrame(countries)
for country in countries['country_name']:
country_file_name = country + '.csv'
vars()[country] = pd.read_csv(country_file_name)
## Here I want to append country to each column except index
When I do France.head()
I get the output as France
index Birth Deaths
2020-01-01 9 10
2002-01-02 5 12
...
2002-12-10 14 10
But I want the output as France
index France_Birth France_Deaths
2020-01-01 9 10
2002-01-02 5 12
....
2002-12-10 14 10
Note - I do not want to do France.columns= ['France_Birth','France_Deaths'] because it will take me days to do it for all the csv.
I am using jupyternote book here.
https://colab.research.google.com/drive/1aOg3eOhsigbewAhRwQE1QsxGDKzEPyW5?usp=sharing
Note sure there is any way to this or I have to change my approach.
This can be achieved using the rename function of pandas.Dataframe:
countries = {'country_name': ['France','Germany','Italy','Japan']}
countries = pd.DataFrame(countries)
for country in countries['country_name']:
country_file_name = country + '.csv'
vars()[country] = pd.read_csv(country_file_name).rename(columns=lambda s: country + "_" + s)
You can check the documentation here.

How can I group my CSV's list of dates into their Months?

I have a CSV file which contains two columns, the first is a date column in the format 01/01/2020 and the second is a number for each month representing the months sales volume. The dates range from 2004 to 2019 and my task is to create a 12 bar chart, with each bar representing the average sales volume for that month across every years data. I attempted to use a groupby function but got an error relating to not having numeric types to aggregate. I am very new to python so apologies for the beginner questions. I Have posted my code so far below. Thanks in advance for any help with this :)
# -*- coding: utf-8 -*-
import csv
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
file = "GlasgowSalesVolume.csv"
data = pd.read_csv(file)
typemean = (data.groupby(['Date', 'SalesVolume'], as_index=False).mean().groupby('Date')
['SalesVolume'].mean())
Output:
DataError: No numeric types to aggregate
enter code here
I prepared a DataFrame limited to just 2 rows and 3 months:
Date Sales
0 01/01/2019 3
1 01/02/2019 4
2 01/03/2019 8
3 01/01/2020 10
4 01/02/2020 20
5 01/03/2020 30
For now Date column is of string type, so the first step is to
convert it to datetime64:
df.Date = pd.to_datetime(df.Date, dayfirst=True)
Now to compute your result, run:
result = df.groupby(df.Date.dt.month).Sales.mean()
The result is a Series containing:
Date
1 6.5
2 12.0
3 19.0
Name: Sales, dtype: float64
The index is the month number (1 thru 12) and the value is the mean from
respective month, from all years.

How to Access the data in XLSX sheet, where some fields are referred to another sheets?

18F-AV-1451-A07 Value refer to another sheet called "CONTENT" in which column "B" and row "3".
I have load the dataframe using code
pd.read_excel('data/A07.xls',sheet_name = 'DM',skiprows = 12, skipfooter = 2)
I'm getting null value in that column of "Conversion Definition" instead of "18F-AV-1451-A07".
how can i get that data in my dataframe, and i don't want to do hardcoded.
First Credits, I didn't actually solve this, i took help from user U9-Forwrad, Now to do this you need
import pandas as pd
xlsx = pd.ExcelFile('Sample.xlsx')
df1 = pd.read_excel(xlsx, 'CONTENT', header=None)
df2 = pd.read_excel(xlsx, 'Sheet2')
boolean = df2['Class'].isin(df1[0].fillna(df1[1]).dropna())
idxs = boolean.index[boolean == True]
print(df2.iloc[idxs[0]:idxs[1]+1])
Which gives you
Day Month Class
1 tuesday Feb CM
2 Wednesday Mar NaN
3 Thursday Apr NaN
4 Friday May NaN
5 Saturday Jun NaN
6 Sunday Jul DM
Which I think is what you are looking for.
Note: You will need to convert the file to xlsx, ODS format isn't supported by pandas.

Change unwanted datetime formatted data values into numbers with dashes in Python

I have data that has been changed due to some Excel formatting issues. When there is a number involved with a - dash it automatically changes into a date format.
For example 1-1 changed into 01-Jan, 25-2 changes to 25-Feb in Excel.
But the data with dashes or other values like 1A and 1001 are in tact. When I load the data into Spyder it actually changes format again into a datetime type.
First the data looks like this in Excel:
Name ID Value
Hello 1A 22
Hi 01-Jan 20
What 02-Jan 12
Is 1001 10
Up 25-Mar 11
The data comes up as a Pandas Dataframe format with the current year (2019) in Python with the code:
import pandas as pd
FAC_sheet = pd.read_excel('data', dtype=str)
Name ID Value
Hello 1A 22
Hi 2019-01-01 00:00:00 20
What 2019-01-02 00:00:00 12
Is 1001 10
Up 2019-03-25 00:00:00 11
Is there a way I can change only the strangely date formatted values and keep the rest in tact? The desired output is:
Name ID Value
Hello 1A 22
Hi 1-1 20
What 1-2 12
Is 1001 10
Up 3-25 11
You can try the below to try and override the Date behave auto conversion in pandas (replace Date with the column name):
pandas.read_excel(xlsx, sheet, converters={'Date': str})
From the docs (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html):
converters : dict, optional
Dict of functions for converting values in certain columns. Keys can either be integers or column labels.

Create empty csv file with pandas

I am interacting through a number of csv files and want to append the mean temperatures to a blank csv file. How do you create an empty csv file with pandas?
for EachMonth in MonthsInAnalysis:
TheCurrentMonth = pd.read_csv('MonthlyDataSplit/Day/Day%s.csv' % EachMonth)
MeanDailyTemperaturesForCurrentMonth = TheCurrentMonth.groupby('Day')['AirTemperature'].mean().reset_index(name='MeanDailyAirTemperature')
with open('my_csv.csv', 'a') as f:
df.to_csv(f, header=False)
So in the above code how do I create the my_csv.csv prior to the for loop?
Just a note I know you can create a data frame then save the data frame to csv but I am interested in whether you can skip this step.
In terms of context I have the following csv files:
Each of which have the following structure:
The Day column reads up to 30 days for each file.
I would like to output a csv file that looks like this:
But obviously includes all the days for all the months.
My issue is that I don't know which months are included in each analysis hence I wanted to use a for loop that used a list that has that information in it to access the relevant csvs, calculate the mean temperature then save it all into one csv.
Input as text:
Unnamed: 0 AirTemperature AirHumidity SoilTemperature SoilMoisture LightIntensity WindSpeed Year Month Day Hour Minute Second TimeStamp MonthCategorical TimeOfDay
6 6 18 84 17 41 40 4 2016 1 1 6 1 1 10106 January Day
7 7 20 88 22 92 31 0 2016 1 1 7 1 1 10107 January Day
8 8 23 1 22 59 3 0 2016 1 1 8 1 1 10108 January Day
9 9 23 3 22 72 41 4 2016 1 1 9 1 1 10109 January Day
10 10 24 63 23 83 85 0 2016 1 1 10 1 1 10110 January Day
11 11 29 73 27 50 1 4 2016 1 1 11 1 1 10111 January Day
Just open the file in write mode to create it.
with open('my_csv.csv', 'w'):
pass
Anyway I do not think you should be opening and closing the file so many times. You'd better open the file once, write several times.
with open('my_csv.csv', 'w') as f:
for EachMonth in MonthsInAnalysis:
TheCurrentMonth = pd.read_csv('MonthlyDataSplit/Day/Day%s.csv' % EachMonth)
MeanDailyTemperaturesForCurrentMonth = TheCurrentMonth.groupby('Day')['AirTemperature'].mean().reset_index(name='MeanDailyAirTemperature')
df.to_csv(f, header=False)
Creating a blank csv file is as simple as this one
import pandas as pd
pd.DataFrame({}).to_csv("filename.csv")
I would do it this way: first read up all your CSV files (but only the columns that you really need) into one DF, then make groupby(['Year','Month','Day']).mean() and save resulting DF into CSV file:
import glob
import pandas as pd
fmask = 'MonthlyDataSplit/Day/Day*.csv'
df = pd.concat((pd.read_csv(f, sep=',', usecols=['Year','Month','Day','AirTemperature']) for f in glob.glob(fmask)))
df.groupby(['Year','Month','Day']).mean().to_csv('my_csv.csv')
and if want to ignore the year:
import glob
import pandas as pd
fmask = 'MonthlyDataSplit/Day/Day*.csv'
df = pd.concat((pd.read_csv(f, sep=',', usecols=['Month','Day','AirTemperature']) for f in glob.glob(fmask)))
df.groupby(['Month','Day']).mean().to_csv('my_csv.csv')
Some details:
(pd.read_csv(f, sep=',', usecols=['Month','Day','AirTemperature']) for f in glob.glob('*.csv'))
will generate tuple of data frames from all your CSV files
pd.concat(...)
will concatenate them into resulting single DF
df.groupby(['Year','Month','Day']).mean()
will produce wanted report as a data frame, which might be saved into new CSV file:
.to_csv('my_csv.csv')
The problem is a little unclear, but assuming you have to iterate month by month, and apply the groupby as stated just use:
#Before loops
dflist=[]
Then in each loop do something like:
dflist.append(MeanDailyTemperaturesForCurrentMonth)
Then at the end:
final_df = pd.concat([dflist], axis=1)
and this will join everything into one dataframe.
Look at:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html
http://pandas.pydata.org/pandas-docs/stable/merging.html
You could do this to create an empty CSV and add columns without an index column as well.
import pandas as pd
df=pd.DataFrame(columns=["Col1","Col2","Col3"]).to_csv(filename.csv,index=False)

Categories

Resources