I am new to python, I would really appreciate the assistance. I trie the entire day. I have a csv file containing 10 columns. I am only interested in 3 state, county and zipcode. I am trying, trying and trying to get a count of the occurrences in each column for instance CA 20000, TX 14000, and having the count result outpute to be saved in a csv files that could be easily imported into excel and further merged with geospatial files.
I managed to select the 3 columns that I need
import numpy as np
from tabulate import tabulate
import pandas as pd
#Replace with path and name file in your computer
filename = "10column.csv"
# Enter the column number that you want to display between [, , ,] no space between the commas "usecols=[3,4,5]"
table = np.genfromtxt(filename,delimiter=',',skip_header=0,dtype='U',usecols=[4,5,6])
print(tabulate(table))
#Insert the path and name of the file
pd.DataFrame(table).to_csv("3column.csv")
Then I tried to count the occurrences but the output it is in the wrong format and I cannot save as csv.
import csv
from collections import Counter
import numpy as np
my_reader = csv.reader(open("3column.csv"))
#insert column number instead of the 2 "[rec[2]"
column = [rec[2] for rec in my_reader]
np.array([Counter(column)])
print(np.array([Counter(column)]))
the result is
[Counter({'22209': 10, '20007': 5, …'})]
I cannot save it as csv and I would like to have on a tabulated format
zip, count
22209, 10, 20007, 10
I would really appreciate your help
A different way to approach would be using value_counts() from Pandas documentation.
Return a Series containing counts of unique values.
Exemple data file 7column.csv
id,state,city,zip,ip_address,latitude,longitude
1,NY,New York City,10005,246.78.179.157,40.6964,-74.0253
2,WA,Yakima,98907,79.61.127.155,46.6288,-120.574
3,OK,Oklahoma City,73109,129.226.225.133,35.4259,-97.5261
4,FL,Orlando,32859,104.196.5.159,28.4429,-81.4026
5,NY,New York City,10004,246.78.180.157,40.6964,-74.0253
6,FL,Orlando,32860,104.196.5.159,29.4429,-81.4026
7,IL,Chicago,60641,19.226.187.13,41.9453,-87.7474
8,NC,Fayetteville,28314,45.109.1.38,35.0583,-79.008
9,IL,Chicago,60642,19.226.187.14,41.9453,-87.7474
10,WA,Yakima,98907,79.61.127.156,46.6288,-120.574
11,IL,Chicago,60643,19.226.187.15,41.9453,-87.7474
12,CA,Sacramento,94237,77.208.31.167,38.3774,-121.4444
import pandas as pd
df = pd.read_csv("7column.csv")
zipcode = df["zip"].value_counts()
state = df["state"].value_counts()
city = df["city"].value_counts()
zipcode.to_csv('zipcode_count.csv')
state.to_csv('state_count.csv')
city.to_csv('city_count.csv')
CSV output files
state_count.csv | city_count.csv | zipcode_count.csv
,state | ,city | ,zip
IL,3 | Chicago,3 | 98907,2
NY,2 | Orlando,2 | 32859,1
FL,2 | New York City,2 | 94237,1
WA,2 | Yakima,2 | 32860,1
NC,1 | Sacramento,1 | 28314,1
OK,1 | Fayetteville,1 | 10005,1
CA,1 | Oklahoma City,1 | 10004,1
| | 60643,1
| | 60642,1
| | 60641,1
| | 73109,1
You could read the file you wrote out to CSV in as a DataFrame and use the count method that Pandas has.
states_3 = pd.DataFrame(table)
state_count = states_3.count(axis='columns')
out_name = 'statecount.xlsx'
with pd.ExcelWriter(out_name) as writer:
state_count.to_excel(writer, sheet_name='counts')
I am attempting to export a dataset that looks like this:
+----------------+--------------+--------------+--------------+
| Province_State | Admin2 | 03/28/2020 | 03/29/2020 |
+----------------+--------------+--------------+--------------+
| South Dakota | Aurora | 1 | 2 |
| South Dakota | Beedle | 1 | 3 |
+----------------+--------------+--------------+--------------+
However the actual CSV file i am getting is like so:
+-----------------+--------------+--------------+
| Province_State | 03/28/2020 | 03/29/2020 |
+-----------------+--------------+--------------+
| South Dakota | 1 | 2 |
| South Dakota | 1 | 3 |
+-----------------+--------------+--------------+
Using this here code (runnable by running createCSV(), pulls data from COVID govt GitHub):
import csv#csv reader
import pandas as pd#csv parser
import collections#not needed
import requests#retrieves URL fom gov data
def getFile():
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID- 19/master/csse_covid_19_data/csse_covid_19_time_series /time_series_covid19_deaths_US.csv'
response = requests.get(url)
print('Writing file...')
open('us_deaths.csv','wb').write(response.content)
#takes raw data from link. creates CSV for each unique state and removes unneeded headings
def createCSV():
getFile()
#init data
data=pd.read_csv('us_deaths.csv', delimiter = ',')
#drop extra columns
data.drop(['UID'],axis=1,inplace=True)
data.drop(['iso2'],axis=1,inplace=True)
data.drop(['iso3'],axis=1,inplace=True)
data.drop(['code3'],axis=1,inplace=True)
data.drop(['FIPS'],axis=1,inplace=True)
#data.drop(['Admin2'],axis=1,inplace=True)
data.drop(['Country_Region'],axis=1,inplace=True)
data.drop(['Lat'],axis=1,inplace=True)
data.drop(['Long_'],axis=1,inplace=True)
data.drop(['Combined_Key'],axis=1,inplace=True)
#data.drop(['Province_State'],axis=1,inplace=True)
data.to_csv('DEBUGDATA2.csv')
#sets province_state as primary key. Searches based on date and key to create new CSVS in root directory of python app
data = data.set_index('Province_State')
data = data.iloc[:,2:].rename(columns=pd.to_datetime, errors='ignore')
for name, g in data.groupby(level='Province_State'):
g[pd.date_range('03/23/2020', '03/29/20')] \
.to_csv('{0}_confirmed_deaths.csv'.format(name))
The reason for the loop is to set the date columns (everything after the first two) to a date, so that i can select only from 03/23/2020 and beyond. If anyone has a better method of doing this, I would love to know.
To ensure it works, it prints out all the field names, inluding Admin2 (county name), province_state, and the rest of the dates.
However, in my CSV as you can see, Admin2 seems to have disappeared. I am not sure how to make this work, if anyone has any ideas that'd be great!
changed
data = data.set_index('Province_State')
to
data = data.set_index((['Province_State','Admin2']))
Needed to create a multi key to allow for the Admin2 column to show. Any smoother tips on the date-range section welcome to reopen
Thanks for the help all!
I have two columns in a dataframe. Column one is named as previous_code and column two is named as New_code.These columns have values as "PO","GO","RO" etc. These codes have priority for example "PO" has higher Priority compared to "GO".I want to compare values of these two columns and Put the output in new column as "High","Low" and "No Change" incase both the columns have same code. Below is the example of how dataframe looks like
CustID|previous_code|New_code
345. | PO. | GO
367. | RO. | PO
385. |PO. | RO
455. |GO. |GO
Expected output Dataframe
CustID|previous_code|New_code|Change
345. | PO. | GO. | Low
367. | RO. | PO. |High
385. |PO. | RO. |Low
455. |GO. |GO. |No Change
If someone could write a demo code for this in pyspark or Pandasthat will be helpful.
Thanks in advance.
If I understood the ordering correctly, this should work fine:
import pandas as pd
import numpy as np
data = {'CustID':[345,367,385,455],'previous_code':['PO','RO','PO','GO'],'New_code':['GO','PO','RO','GO']}
df = pd.DataFrame(data)
mapping = {'PO':1,'GO':2,'RO':3}
df['previous_aux'] = df['previous_code'].map(mapping)
df['new_aux'] = df['New_code'].map(mapping)
df['output'] = np.where(df['previous_aux'] == df['new_aux'],'No change',np.where(df['previous_aux'] > df['new_aux'],'High','Low'))
df = df[['CustID','previous_code','New_code','output']]
print(df)
Output:
CustID previous_code New_code output
0 345 PO GO Low
1 367 RO PO High
2 385 PO RO Low
3 455 GO GO No change
I'm trying to import an excel sheet with data into python using pandas but I get a pandas parser error where the expected fields are 10, but saw 11.
when I specify the columns it prints all the data plus their column headings but double the column heading as a row of data.
import pandas as pd
columns=['bookID','title','authors','average_rating','isbn','isbn13','language_code','# num_pages','ratings_count','text_reviews_count']
df = pd.read_csv (r'path of the csv file', name=columns)
print(df)
shows the column heading
bookID | title | authors | average_rating | isbn |isbn13 | language_code | # num_pages | ratings_count | text_reviews_count
and then again adds the column heading as the first row of data
0 | bookID | title | authors | average_rating | isbn |isbn13 | language_code | # num_pages | ratings_count | text_reviews_count
I suppose you can simply use header='infer' as you already have the column headers in the CSV file which you are reading from.
df = pd.read_csv (r'path of the csv file',header='infer')
pandas.read_csv
I'm trying to concatenate two data frames and write said data-frame to an excel file. The concatenation is performed somewhat successfully, but I'm having a difficult time eliminating the index row that also gets appended.
I would appreciate it if someone could highlight what it is I'm doing wrong. I thought providing the "index = False" argument at every excel call would eliminate the issue, but it has not.
enter image description here
Hopefully you can see the image, if not please let me know.
# filenames
file_name = "C:\\Users\\ga395e\\Desktop\\TEST_FILE.xlsx"
file_name2 = "C:\\Users\\ga395e\\Desktop\\TEST_FILE_2.xlsx"
#create data frames
df = pd.read_excel(file_name, index = False)
df2 = pd.read_excel(file_name2,index =False)
#filter frame
df3 = df2[['WDDT', 'Part Name', 'Remove SN']]
#concatenate values
df4 = df3['WDDT'].map(str) + '-' +df3['Part Name'].map(str) + '-' + 'SN:'+ df3['Remove SN'].map(str)
test=pd.DataFrame(df4)
test=test.transpose()
df = pd.concat([df, test], axis=1)
df.to_excel("C:\\Users\\ga395e\\Desktop\\c.xlsx", index=False)
Thanks
so as the other users also wrote I dont see the index in your image as well because in this case you would have an output which would be like the following:
| Index | Column1 | Column2 |
|-------+----------+----------|
| 0 | Entry1_1 | Entry1_2 |
| 1 | Entry2_1 | Entry2_2 |
| 2 | Entry3_1 | Entry3_2 |
if you pass the index=False option the index will be removed:
| Column1 | Column2 |
|----------+----------|
| Entry1_1 | Entry1_2 |
| Entry2_1 | Entry2_2 |
| Entry3_1 | Entry3_2 |
| | |
which looks like it your case. Your problem be could related to the concatenation and the transposed matrix.
Did you check here you temporary dataframe before exporting it?
You might want to check if pandas imports the time column as a time index
if you want to delete those time columns you could use df.drop and pass an array of columns into this function, e.g. with df.drop(df.columns[:3]). Does this maybe solve your problem?