I have fips codes here: http://www2.census.gov/geo/docs/reference/codes/files/national_county.txt
And a dataset that looks like this:
fips_state fips_county value
1 1 10
1 3 34
1 5 37
1 7 88
1 9 93
How can I get the county name of each row using the data from the link above with pandas?
Simply load both data sets into DataFrames, then set the appropriate index:
df1.set_index(['fips_state', 'fips_county'], inplace=True)
This gives you a MultiIndex by state+county. Once you've done this for both datasets, you can trivially map them, for example:
df1['county_name'] = df2.county_name
Related
I have a pandas DataFrame called data_combined with the following structure:
index corr_year corr_5d
0 (DAL, AAL) 0.873762 0.778594
1 (WEC, ED) 0.851578 0.850549
2 (CMS, LNT) 0.850028 0.776143
3 (SWKS, QRVO) 0.850799 0.830603
4 (ALK, DAL) 0.874162 0.744590
Now I am trying to divide the column named index into two columns on the (,).
The desired output should look like this:
index1 index2 corr_year corr_5d
0 DAL AAL 0.873762 0.778594
1 WEC ED 0.851578 0.850549
2 CMS LNT 0.850028 0.776143
3 SWKS QRVO 0.850799 0.830603
4 ALK DAL 0.874162 0.744590
I have tried using pd.explode() with the following code
data_results_test = data_results_combined.explode('index')
data_results_test
Which leads to the following output:
index corr_year corr_5d
0 DAL 0.873762 0.778594
0 AAL 0.873762 0.778594
1 WEC 0.851578 0.850549
1 ED 0.851578 0.850549
How can I achieve the split with newly added columns instead of rows. pd.explode does not seem to have any option to choose wether to add new rows or columns
How about a simple apply? (Assuming 'index' column is a tuple)
data_results_combined['index1'] = data_results_combined['index'].apply(lambda x: x[0])
data_results_combined['index2'] = data_results_combined['index'].apply(lambda x: x[1])
df[['index1','index2']] = df['index'].str.split(',',expand=True)
This question already has an answer here:
Pandas : zscore among the groups
(1 answer)
Closed 5 months ago.
below is example of df I use, sales data. df is big, having several Gb of data, few thousands brands, data for past 12 months, hundred of territories.
index date brand territory value
0 2019-01-01 A 1 63
1 2019-02-01 A 1 91
2 2019-03-01 A 1 139
3 2019-04-01 A 1 80
4 2019-05-01 A 1 149
I want to find outliers for each individual brand across all territories for all dates
To find outliers within whole df I can use use
outliers = df[(np.abs(stats.zscore(df['value'])) > 3)]
or stats.zscore(df['value'] just to calculate z-score
I would like to add column df[z-score]
so I though about something like this but apparently it doesn't work
df['z-score'] = df.groupby('brand', as_index=False)['value'].stats.zscore(df['value'])
Use transform
df['z-score'] = df.groupby('brand')['value'].transform(stats.zscore)
This question already has answers here:
Convert columns into rows with Pandas
(6 answers)
Closed 2 years ago.
I have a large table with multiple columns as input table in below format:
Col-A Col-B Col-C Col-D Col-E Col-F
001 10 01/01/2020 123456 123123 123321
001 20 01/02/2020 123456 123111
002 10 01/03/2020 111000 111123
And I'd like to write a code such that it will show lines per each Col-A and so that instead of multiple columns Col-D,E,F I will only have Col-D:
Col-A Col-B Col-C Col-D
001 10 01/01/2020 123456
001 10 01/01/2020 123123
001 10 01/01/2020 123321
001 20 01/02/2020 123456
001 20 01/02/2020 123111
002 10 01/03/2020 111000
002 10 01/03/2020 111123
Any ideas will be appreciated,
Thanks,
Nurbek
You can use pd.melt
import pandas as pd
newdf = pd.melt(
df,
id_vars=['Col-A', 'Col-B', 'Col-C'],
value_vars=['Col-D', 'Col-E', 'Col-F']
).dropna()
This will drop 'Col-D', 'Col-E' and 'Col-F', but create two new columns variable and value. Variable column will denote the column from which your value came from. To achieve what you want ultimately, you can drop the variable column and rename the value column to Col-D.
newdf = newdf.drop(['variable'], axis=1)
newdf = newdf.rename(columns={"value":"Col-D"})
What about something like this:
df2 = df[["Col-A","Col-B","Col-C","Col-D"]]
columns = ["Col-E","Col-F",...,"Col-Z"]
for col in columns:
df2.append(df[["Col-A","Col-B","Col-C",col]]).reset_index(drop=True)
You just append the columns you want to your original dataframe
I have a frame moviegoers that includes zip codes but not cities.
I then redefined moviegoers to be zipcodes and changed the data type of zip codes to be a data frame instead of a series.
zipcodes = pd.read_csv('NYC1-moviegoers.csv',dtype={'zip_code': object})
I know the dataset URL I need is this: https://raw.githubusercontent.com/mafudge/datasets/master/zipcodes/free-zipcode-database-Primary.csv.
I defined a dataframe, zip_codes, to call the data from that dataset and change the dataset type from series to dataframe so its in the same format as the zipcodes dataframe.
I want to merge the dataframes so I can have the movie goer data. But, instead of zipcodes, I want to have the state abbreviation. This is where I am having issues.
The end goal is to count the number of movie goers per state. Example ideal output:
CA 116
MN 78
NY 60
TX 51
IL 50
Any ideas would be greatly appreciated.
I think need map by Series and then use value_counts for count:
print (zipcodes)
zip_code
0 85711
1 94043
2 32067
3 43537
4 15213
s = zip_codes.set_index('Zipcode')['State']
df = zipcodes['zip_code'].map(s).value_counts().rename_axis('state').reset_index(name='count')
print (df.head())
state count
0 OH 1
1 CA 1
2 FL 1
3 AZ 1
4 PA 1
Simply merge both datasets on Zipcode columns then run groupby for state counts.
# READ DATA FILES WITH RENAMING OF ZIP COLUMN IN FIRST
url = "https://raw.githubusercontent.com/mafudge/datasets/master/zipcodes/free-zipcode-database-Primary.csv"
moviegoers = pd.read_csv('NYC1-moviegoers.csv', dtype={'zip_code': object}).rename(columns={'zip_code': 'Zipcode'})
zipcodes = pd.read_csv(url, dtype={'Zipcode': object})
# MERGE ON COMMON FIELD
merged_df = pd.merge(moviegoers, zipcodes, on='Zipcode')
# AGGREGATE BY INDICATOR (STATE)
merged_df.groupby('State').size()
# ALTERNATIVE GROUP BY COUNT
merged_df.groupby('State')['Zipcode'].agg('count')
I originally have 3 columns, timestamp,response_time and type columns, what I need to do is find the mean of response time where all timestamps are same hence I grouped all timestamps together and applied mean function on them. I got the following series which is fine:
0 16.949689
1 17.274615
2 16.858884
3 17.025155
4 17.062008
5 16.846885
6 17.172994
7 17.025797
8 17.001974
9 16.924636
10 16.813300
11 17.152066
12 17.291899
13 16.946970
14 16.972884
15 16.871824
16 16.840024
17 17.227682
18 17.288211
19 17.370553
20 17.395759
21 17.449579
22 17.340357
23 17.137308
24 16.981012
25 16.946727
26 16.947073
27 16.830850
28 17.366538
29 17.054468
30 16.823983
31 17.115429
32 16.859003
33 16.919645
34 17.351895
35 16.930233
36 17.025194
37 16.824997
And I need to be able to plot column 1 vs column 2, but I am not abel to extract them seperately.
I obtained this column by doing groupby('timestamp') and then a mean() on that.
The problem I need to solve is how to extract each column of this series? or is there a better way to calculate the mean of 1 column for all same entries of another column?
ORIGINAL DATA :
1445544152817,SEND_MSG,123
1445544152817,SEND_MSG,123
1445544152829,SEND_MSG,135
1445544152829,SEND_MSG,135
1445544152830,SEND_MSG,135
1445544152830,GET_QUEUE,12
1445544152830,SEND_MSG,136
1445544152830,SEND_MSG,136
1445544152830,SEND_MSG,136
1445544152831,SEND_MSG,138
1445544152831,SEND_MSG,136
1445544152831,SEND_MSG,137
1445544152831,SEND_MSG,137
1445544152831,SEND_MSG,137
1445544152832,SEND_MSG,138
1445544152832,SEND_MSG,138
1445544152833,SEND_MSG,138
1445544152833,SEND_MSG,139
1445544152834,SEND_MSG,140
1445544152834,SEND_MSG,140
1445544152834,SEND_MSG,140
1445544152835,SEND_MSG,140
1445544152835,SEND_MSG,141
1445544152849,SEND_MSG,155
1445544152849,SEND_MSG,155
1445544152850,GET_QUEUE,21
1445544152850,GET_QUEUE,21
For each timestamp I want to find average of response_time and plot,I did that successfully as shown in the series above(first data) but I cannot seperate the timestamp and response_time columns anymore.
A Series always has just one column. The first column you see is the index. You can get it by your_series.index(). If you want the timestamp to become a data column again, and not an index you can use the as_index keyword in groupby:
df.groupby('timestamp', as_index = False).mean()
Or use your_series.reset_index().
if its a series, you can directly use:
your_series.mean()
you can extract the column by:
df['column_name']
then you can apply mean() to the series:
df['column_name'].mean()