Pandas Data Frame Merge

Pandas Data Frame Merge - python

I am new to Pandas. I am trying to make a data set with ZIP Code, Population in that ZIP Code, and Number of Counties in the ZIP Code.
I get the data from census website: https://www2.census.gov/geo/docs/maps-data/data/rel/zcta_county_rel_10.txt
I am trying with the following code, but it is not working. Could you help me to figure out the correct code? I have a hunch that the error is due to data frame or sorts related to data type. But I can not work out the correct code to make it right. Please let me know your thoughts. Thank you in advance!
import pandas as pd
df = pd.read_csv("zcta_county_rel_10.txt", dtype={'ZCTA5': str, 'STATE': str, 'COUNTY': str}, usecols=['ZCTA5', 'STATE', 'COUNTY', 'ZPOP'])
zcta_pop = df.drop_duplicates(subset={'ZCTA5', 'ZPOP'}).drop(['STATE', 'COUNTY'], 1)
zcta_ct_county = df['ZCTA5'].value_counts()
zcta_ct_county.columns = ['ZCTA5', 'CT_COUNTY']
pre_merge_1 = pd.merge(zcta_pop, zcta_ct_county, on='ZCTA5')[['ZCTA5', 'ZPOP', 'CT_COUNTY']]
Here is my error message:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/python27/lib/python2.7/site-packages/pandas/tools/merge.py", line 58, in merge copy=copy, indicator=indicator)
File "/usr/local/python27/lib/python2.7/site-packages/pandas/tools/merge.py", line 473, in __init__ 'type {0}'.format(type(right)))
ValueError: can not merge DataFrame with instance of type <class 'pandas.core.series.Series'>
SOLUTION
import pandas as pd
df = pd.read_csv("zcta_county_rel_10.txt", dtype={'ZCTA5': str, 'STATE': str, 'COUNTY': str}, usecols=['ZCTA5', 'STATE', 'COUNTY', 'ZPOP'])
zcta_pop = df.drop_duplicates(subset={'ZCTA5', 'ZPOP'}).drop(['STATE', 'COUNTY'], 1)
zcta_ct_county = df['ZCTA5'].value_counts().reset_index()
zcta_ct_county.columns = ['ZCTA5', 'CT_COUNTY']
pre_merge_1 = pd.merge(zcta_pop, zcta_ct_county, on='ZCTA5')[['ZCTA5', 'ZPOP', 'CT_COUNTY']]

I think you need add reset_index, because output of value_counts is Series and need DataFrame with 2 columns:
zcta_ct_county = df['ZCTA5'].value_counts().reset_index()

Related

Conversion RGB to xyY with colormath

With colormath I make a conversion from RGB to xyY value. It works fine for 1 RGB value, but I can't find the right code to do the conversion for multiple RGB values imported from an Excel. I use to following code:
from colormath.color_objects import sRGBColor, xyYColor
from colormath.color_conversions import convert_color
import pandas as pd
data = pd.read_excel(r'C:/Users/User/Desktop/Color/Fontane/RGB/FontaneHuco.xlsx')
df = pd.DataFrame(data, columns=['R', 'G', 'B'])
#print(df)
rgb = sRGBColor(df['R'],df['G'],df['B'], is_upscaled=True)
xyz = convert_color(rgb, xyYColor)
print(xyz)
But when i run this code i receive to following error:
Traceback (most recent call last):
File "C:\Users\User\PycharmProjects\pythonProject4\Overige\Chroma.py", line 9, in <module>
lab = sRGBColor(df['R'], df['G'], df['B'])
File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\colormath\color_objects.py", line 524, in __init__
self.rgb_r = float(rgb_r)
File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\series.py", line 141, in wrapper
raise TypeError(f"cannot convert the series to {converter}")
TypeError: cannot convert the series to <class 'float'>
Does anyone has an idea how to fix this problem?

convert_color is expecting floats and you're giving it dataframe columns instead. You need to apply the conversion one row at at time, which can be done as follows:
xyz = df.apply(
lambda row: convert_color(
sRGBColor(row.R, row.G, row.B, is_upscaled=True), xyYColor
),
axis=1,
)

Object of type 'float' has no len() error when slicing pandas dataframe json column

I have data that looks like this. In each column, there are value/keys of varying different lengths. Some rows are also NaN.
like match
0 [{'timestamp', 'type'}] [{'timestamp', 'type'}]
1 [{'timestamp', 'comment', 'type'}] [{'timestamp', 'type'}]
2 NaN NaN
I want to split these lists into their own columns. I want to keep all the data (and make it NaN if it is missing). I've tried following this tutorial and doing this:
df1 = pd.DataFrame(df['like'].values.tolist())
df1.columns = 'like_'+ df1.columns
df2 = pd.DataFrame(df['match'].values.tolist())
df2.columns = 'match_'+ df2.columns
col = df.columns.difference(['like','match'])
df = pd.concat([df[col], df1, df2],axis=1)
I get this error.
Traceback (most recent call last):
File "link to my file", line 12, in <module>
df1 = pd.DataFrame(df['like'].values.tolist())
File "/usr/local/lib/python3.9/site-packages/pandas/core/frame.py", line 509, in __init__
arrays, columns = to_arrays(data, columns, dtype=dtype)
File "/usr/local/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 524, in to_arrays
return _list_to_arrays(data, columns, coerce_float=coerce_float, dtype=dtype)
File "/usr/local/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 561, in _list_to_arrays
content = list(lib.to_object_array(data).T)
File "pandas/_libs/lib.pyx", line 2448, in pandas._libs.lib.to_object_array
TypeError: object of type 'float' has no len()
Can someone help me understand what I'm doing wrong?

You can't perform values.tolist() on NaN. If you delete that row of NaNs, you can get past this issue. but then your prefix line fails. See this for prefixes.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.add_prefix.html

Python - Pandas - resample issue

I am trying to adapt a Pandas.Series with a certain frequency to a Pandas.Series with a different frequency. Therefore I used the resample function but it does not recognize for instance that 'M' is a subperiod of '3M' and raised an error
import pandas as pd
idx_1 = pd.period_range('2017-01-01', periods=6, freq='M')
data_1 = pd.Series(range(6), index=idx_1)
data_higher_freq = data_1.resample('3M', kind="Period").sum()
Raises the following exception:
Traceback (most recent call last): File "/home/mitch/Programs/Infrastructure_software/Sandbox/spyderTest.py", line 15, in <module>
data_higher_freq = data_1.resample('3M', kind="Period").sum() File "/home/mitch/anaconda3/lib/python3.6/site-packages/pandas/core/resample.py", line 758, in f return self._downsample(_method, min_count=min_count) File "/home/mitch/anaconda3/lib/python3.6/site-packages/pandas/core/resamplepy", line 1061, in _downsample 'sub or super periods'.format(ax.freq, self.freq))
pandas._libs.tslibs.period.IncompatibleFrequency: Frequency <MonthEnd> cannot be resampled to <3 * MonthEnds>, as they are not sub or super periods
This seems to be due to the pd.tseries.frequencies.is_subperiod function:
import pandas as pd
pd.tseries.frequencies.is_subperiod('M', '3M')
pd.tseries.frequencies.is_subperiod('M', 'Q')
Indeed it returns False for the first command and True for the second.
I would really appreciate any hints about any solution.
Thks.

Try changing from PeriodIndex to DateTimeIndex before resampling:
import pandas as pd
idx_1 = pd.period_range('2017-01-01', periods=6, freq='M')
data_1 = pd.Series(range(6), index=idx_1)
data_1.index = data_1.index.astype('datetime64[ns]')
data_higher_freq = data_1.resample('3M', kind='period').sum()
Output:
data_higher_freq
Out[582]:
2017-01 3
2017-04 12
Freq: 3M, dtype: int64

panda tseries convertion not working

I am new to python and I am trying to build Time Series through this. I am trying to convert this csv data into time series, however by the internet and stack research, 'result' should have had
<class 'pandas.tseries.index.DatetimeIndex'>,
but my output is not converted time series. Why is it not converting? How do I convert it? Thanks for the help in advance.
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
data = pd.read_csv('somedata.csv')
print data.head()
#selecting specific columns by column name
df1 = data[['a','b']]
#converting the data to time series
dates = pd.date_range('2015-01-01', '2015-12-31', freq='H')
dates #preview
results:
DatetimeIndex(['2015-01-01 00:00:00', '2015-01-01 01:00:00',
...
'2015-12-31 23:00:00', '2015-12-31 00:00:00'],
dtype='datetime64[ns]', length=2161, freq='H')
Above is working, however I get error below:
df1 = Series(df1[:,2], index=dates)
output:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'Series' is not defined
After attempting the pd.Series...
df1 = pd.Series(df1[:,2], index=dates)
Error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/someid/miniconda2/lib/python2.7/site- packages/pandas/core/frame.py", line 1992, in __getitem__
return self._getitem_column(key)
File "/home/someid/miniconda2/lib/python2.7/site- packages/pandas/core/frame.py", line 1999, in _getitem_column
return self._get_item_cache(key)
File "/home/someid/miniconda2/lib/python2.7/site- packages/pandas/core/generic.py", line 1343, in _get_item_cache
res = cache.get(item)
TypeError: unhashable type

you do need to have the pd.Series. However, you were also doing something else wrong. I'm assuming you want to get all rows, 2nd column of df1 and return a pd.Series with an index of dates.
Solution
df1 = pd.Series(df1.iloc[:, 1], index=dates)
Explanation
df1.iloc is used to return the slice of df1 by row/column postitioning
[:, 1] gets all rows, 2nd columns
Also, df1.iloc[:, 1] returns a pd.Series and can be passed into the pd.Series constructor.

Add geo-location data to Pandas data frame

I'm importing a CSV into a pandas data frame, and then I'm trying to create three new columns in that data frame from data retrieved from geopy.geocoders.GoogleV3() :
import pandas from geopy.geocoders import GoogleV3
DATA = pandas.read_csv("file/to/csv")
geolocator = GoogleV3()
DATA.googleaddress, (DATA.latitude, DATA.longitude) = geolocator.geocode(DATA.address)
Problem is I keep getting this error:
Traceback (most recent call last):
File "C:/Path/To/GeoCoder.py", line 9, in <module>
DATA.googleaddress, (DATA.latitude, DATA.longitude) = geolocator.geocode(DATA.address)
TypeError: 'NoneType' object is not iterable
What does this error mean and how do I get around it?

Because geolocator.geocode expects a single argument at a time, not a list (or array).
You could try:
locs = [ geolocator.geocode(addr) for addr in DATA.address ]
geo_info = pandas.DataFrame(
[ (addr.address, addr.latitude, addr.longitude) for addr in locs ],
columns=['googleaddress', 'latitude', 'longitude'])
All you would have to do is merge these DataFrames:
DATA.combine_first(geo_info)
Nnote that it is considered bad form to have an all-uppercase variable in python.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Data Frame Merge - python

I think you need add reset_index, because output of value_counts is Series and need DataFrame with 2 columns: zcta_ct_county = df['ZCTA5'].value_counts().reset_index()

Related

Conversion RGB to xyY with colormath

Object of type 'float' has no len() error when slicing pandas dataframe json column

Python - Pandas - resample issue

panda tseries convertion not working

Add geo-location data to Pandas data frame

Categories

Resources