I am brand new within the community, so I hope you pay patience a little bit.
I am trying to merge two datasets by using an inner join on the fields "Postal Code" and "Date".
The orgininal code would be like this:
Datapump = pd.merge(hack, health, how='inner', left_on=['Date', 'CP'], right_on=['Creation', 'cp'])
But the point is that I get an empty dataset whenever I tried to perform an head and worst an error in case to perform a sample:
So I have put as indexes the field 'Date' for hack and the field 'Creation' for health. Then I go for the join.
Datapump = pd.merge(hack, health, how='inner', left_index=True, right_index=True)
Unfortunately I need as well the field postal code. So I do another join at the following point
Datapump = pd.merge(hack, health, how='inner', left_on=['CP'], right_on=['cp'])
Now I can get the sample and the head, but anything is going weird according to me, especially once I see the number of entries of the new dataset:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 803206 entries, 0 to 803205
Data columns (total 15 columns):
CP 803206 non-null object
Tipo Contaminante 803206 non-null int64
Valor 803206 non-null float64
Verified 803206 non-null object
nombre 803206 non-null object
edad 801296 non-null object
cp 803206 non-null object
patologia 802387 non-null object
created 803206 non-null datetime64[ns]
Edad_Cat 786829 non-null category
Duration 772661 non-null timedelta64[ns]
Duration_Seconds 772661 non-null float64
weekdays_created 803206 non-null int64
month 803206 non-null float64
cat_month 803206 non-null int64
dtypes: category(1), datetime64ns, float64(3), int64(3), object(6), timedelta64ns
memory usage: 92.7+ MB
Actually before health had roughly 9000 entries and hack roughly 6000 entries.
It cannot be that I get a dataset of 803.206 entries, by performing an inner join.
How can I do this inner join in a way that it can provide a meaningful and reasonable result?
Thank a lot for the patience.
Andrea
eventually I was able to solve the issue. The issue was due to a problem inside the data frame. I open the orginial csv file and manually I clean the problematic rows. Then I re-import the new file and I was able to carry out the join.
Regards,
Andrea
Related
I am reading a csv file as a dataframe in python. Then i use below two commands to get more information about those files.
Is there a way to copy output of these two commands into separate data frames?
data.describe(include='all')
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 carat 53940 non-null float64
1 cut 53940 non-null object
2 color 53940 non-null object
3 clarity 53940 non-null object
4 depth 53940 non-null float64
5 table 53940 non-null float64
6 price 53940 non-null int64
7 x 53940 non-null float64
8 y 53940 non-null float64
9 z 53940 non-null float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.1+ MB
Regarding df.describe, as it's of type Dataframe itself, you can either create a new dataframe directly or save it to csv as below:
des=pd.DataFrame(df.describe())
or
df.describe().to_csv()
Regarding df.info(), this is of type 'Nonetype' which means that cannot be saved directly. You can check for some alternative solutions here:
Is there a way to export pandas dataframe info -- df.info() into an excel file?
I suppose, it should be easy question for experienced guys. I want to group records by week' day and to have number of records at particular week-day.
Here is my DataFrame rent_week.info():
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1689 entries, 3 to 1832
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 1689 non-null int64
1 createdAt 1689 non-null datetime64[ns]
2 updatedAt 1689 non-null datetime64[ns]
3 endAt 1689 non-null datetime64[ns]
4 timeoutAt 1689 non-null datetime64[ns]
5 powerBankId 1689 non-null int64
6 station_id 1689 non-null int64
7 endPlaceId 1686 non-null float64
8 endStatus 1689 non-null object
9 userId 1689 non-null int64
10 station_name 1689 non-null object
dtypes: datetime64[ns](4), float64(1), int64(4), object(2)
memory usage: 158.3+ KB
Data in 'createdAt' columns looks like "2020-07-19T18:00:27.190010000"
I am trying to add new column:
rent_week['a_day'] = rent_week['createdAt'].strftime('%A')
and receive error back: AttributeError: 'Series' object has no attribute 'strftime'.
Meanwhile, if I write:
a_day = datetime.today()
print(a_day.strftime('%A'))
it shows expected result. In my understanding, a_day and rent_week['a_day'] have similar type datetime.
Even request through:
rent_week['a_day'] = pd.to_datetime(rent_week['createdAt']).strftime('%A')
shows me the same error: no strftime attribute.
I even didn't start grouping my data. What I am expecting in result is a DataFrame with information like:
a_day number_of_records
Monday 101
Tuesday 55
...
Try a_day.dt.strftime('%A') - note the additional .dt on your DataFrame column/Series object.
Background: the "similar" type assumption you make is almost correct. However, as a column could be of many types (numeric, string, datetime, geographic, ...), the methods of the underlying values are typically stored in a namespace to not clutter the already broad API (method count) of the Series type itself. That's why string functions are available only through .str, and datetime functions only available through .dt.
You can make a lambda function for conversion and apply that function to the column of "createdAt" Columns. After this step you can groupby based on your requirement. You can take help from this code:
rent_week['a_day'] = rent_week['createdAt'].apply(lambda x: x.strftime('%A'))
Thank you Quamar and Ojdo for your contribution. I found the problem: it is in index
<ipython-input-41-a42a82727cdd>:8: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
rent_week['a_day'] = rent_week['createdAt'].dt.strftime('%A')
as soon as I reset index
rent_week.reset_index()
both variants are working as expected!
In python3 and pandas I have two dataframes with the same structure
df_posts_final_1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32669 entries, 0 to 32668
Data columns (total 12 columns):
post_id 32479 non-null object
text 31632 non-null object
post_text 30826 non-null object
shared_text 3894 non-null object
time 32616 non-null object
image 24585 non-null object
likes 32669 non-null object
comments 32669 non-null object
shares 32669 non-null object
post_url 26157 non-null object
link 4343 non-null object
cpf 32669 non-null object
dtypes: object(12)
memory usage: 3.0+ MB
df_posts_final_2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33883 entries, 0 to 33882
Data columns (total 12 columns):
post_id 33698 non-null object
text 32755 non-null object
post_text 31901 non-null object
shared_text 3986 non-null object
time 33829 non-null object
image 25570 non-null object
likes 33883 non-null object
comments 33883 non-null object
shares 33883 non-null object
post_url 27286 non-null object
link 4446 non-null object
cpf 33883 non-null object
dtypes: object(12)
memory usage: 3.1+ MB
I want to unite them and I could just do it like this:
frames = [df_posts_final_1, df_posts_final_1]
result = pd.concat(frames)
But the "post_id" column has unique identification codes. So when there is an id "X" in df_posts_final_1 it doesn't need to appear two times in the final dataframe result
Example, if the code "FLK1989" appears in df_posts_final_1 and also in df_posts_final_2, I leave only the last record that was in df_posts_final_2
Please, does anyone know the correct strategy to do this?
Fix your code add groupby + tail
frames = [df_posts_final_1, df_posts_final_2]
result = pd.concat(frames).groupby('post_id').tail(1)
Or we do drop_duplicates
frames = [df_posts_final_2,df_posts_final_1]#order here is important
result = pd.concat(frames).drop_duplicates('post_id')
Try to use:
result = pd.concat(frames).drop_duplicates(subset='post_id', keep='last')
The keep='last' parameter will keep only the second one, as you want.
Hello I have an issue to convert column of object to integer for complete column.
I have a data frame and I tried to convert some columns that are detected as Object into Integer (or Float) but all the answers I already found are working for me
First status
Then I tried to apply the to_numeric method but doesn't work.
To numeric method
Then a custom method that you can find here: Pandas: convert dtype 'object' to int
but doesn't work either: data3['Title'].astype(str).astype(int)
( I cannot pass the image anymore - You have to trust me that it doesn't work)
I tried to use the inplace statement but doesn't seem to be integrated in those methods:
I am pretty sure that the answer is dumb but cannot find it
You need assign output back:
#maybe also works omit astype(str)
data3['Title'] = data3['Title'].astype(str).astype(int)
Or:
data3['Title'] = pd.to_numeric(data3['Title'])
Sample:
data3 = pd.DataFrame({'Title':['15','12','10']})
print (data3)
Title
0 15
1 12
2 10
print (data3.dtypes)
Title object
dtype: object
data3['Title'] = pd.to_numeric(data3['Title'])
print (data3.dtypes)
Title int64
dtype: object
data3['Title'] = data3['Title'].astype(int)
print (data3.dtypes)
Title int32
dtype: object
As python_enthusiast said ,
This command works for me too
data3.Title = data3.Title.str.replace(',', '').astype(float).astype(int)
but also works fine with
data3.Title = data3.Title.str.replace(',', '').astype(int)
you have to use str before replace in order to get rid of commas only then change it to int/float other wise you will get error .
2 years and 11 months later, but here I go.
It's important to check if your data has any spaces, special characters (like commas, dots, or whatever else) first. If yes, then you need to basically remove those and then convert your string data into float and then into an integer (this is what worked for me for the case where my data was numerical values but with commas, like 4,118,662).
data3.Title = data3.Title.str.replace(',', '').astype(flaoat).astype(int)
also you can try this code, work fine with me
data3.Title= pd.factorize(data3.Title)[0]
Version that works with Nulls
With older version of Pandas there was no NaN for int but newer versions of pandas offer Int64 which has pd.NA.
So to go from object to int with missing data you can do this.
df['col'] = df['col'].astype(float)
df['col'] = df['col'].astype('Int64')
By switching to float first you avoid object cannot be converted to an IntegerDtype error.
Note it is capital 'I' in the Int64.
More info here https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
Working with pd.NA
In Pandas 1.0 the new pd.NA datatype has been introduced; the goal of pd.NA is to provide a “missing” indicator that can be used consistently across data types (instead of np.nan, None or pd.NaT depending on the data type).
With this in mind they have created the dataframe.convert_dtypes() and Series.convert_dtypes() functions which converts to datatypes that support pd.NA. This is currently considered experimental but might well be a bright future.
I had a dataset like this
dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79902 entries, 0 to 79901
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Query 79902 non-null object
1 Video Title 79902 non-null object
2 Video ID 79902 non-null object
3 Video Views 79902 non-null object
4 Comment ID 79902 non-null object
5 cleaned_comments 79902 non-null object
dtypes: object(6)
memory usage: 5.5+ MB
Removed the None, NaN entries using
dataset = dataset.replace(to_replace='None', value=np.nan).dropna()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 79868 entries, 0 to 79901
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Query 79868 non-null object
1 Video Title 79868 non-null object
2 Video ID 79868 non-null object
3 Video Views 79868 non-null object
4 Comment ID 79868 non-null object
5 cleaned_comments 79868 non-null object
dtypes: object(6)
memory usage: 6.1+ MB
Notice the reduced entries
But the Video Views were floats, as shown in dataset.head()
Then I used
dataset['Video Views'] = pd.to_numeric(dataset['Video Views'])
dataset['Video Views'] = dataset['Video Views'].astype(int)
Now,
<class 'pandas.core.frame.DataFrame'>
Int64Index: 79868 entries, 0 to 79901
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Query 79868 non-null object
1 Video Title 79868 non-null object
2 Video ID 79868 non-null object
3 Video Views 79868 non-null int64
4 Comment ID 79868 non-null object
5 cleaned_comments 79868 non-null object
dtypes: int64(1), object(5)
memory usage: 6.1+ MB
I have 4 files which I want to read with Python / Pandas, the files are: https://github.com/kelsey9649/CS8370Group/tree/master/TaFengDataSet
I stripped away the first row (column titles in chinese) in all 4 files.
But other from that, the 4 files are supposed to have the same format.
Now I want to read them and merge into one big DataFrame. I tried it by using
pars = {'sep': ';',
'header': None,
'names': ['date','customer_id','age','area','prod_class','prod_id','amount','asset','price'],
'parse_dates': [0]}
df = pd.DataFrame()
for i in ('01', '02', '12', '11'):
df = df.append(pd.read_csv(cfg.abspath+'D'+i,**pars))
BUT: The file D11 gives me a different format of the single columns and thus cannot be merged properly. The file contains like over 200k lines and thus I cannot easily look for the problem in that file but as mentioned above, I was assuming it has the same format, but obviously there's some small difference in the format.
What's the easiest way of now investigating into the problem? Obviously, I cannot check every single line in that file...
When I read the 3 working files and merge them; and read D11 independetly, the line
A = pd.read_csv(cfg.abspath+'D11',**pars)
still gives me the following warning:
C:\Python27\lib\site-packages\pandas\io\parsers.py:1130: DtypeWarning: Columns (
1,4,5,6,7,8) have mixed types. Specify dtype option on import or set low_memory=
False.
data = self._reader.read(nrows)
Using the method .info() in pandas (for A and df) yields:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 594119 entries, 0 to 178215
Data columns (total 9 columns):
date 594119 non-null datetime64[ns]
customer_id 594119 non-null int64
age 594119 non-null object
area 594119 non-null object
prod_class 594119 non-null int64
prod_id 594119 non-null int64
amount 594119 non-null int64
asset 594119 non-null int64
price 594119 non-null int64
dtypes: datetime64[ns](1), int64(6), object(2)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 223623 entries, 0 to 223622
Data columns (total 9 columns):
date 223623 non-null object
customer_id 223623 non-null object
age 223623 non-null object
area 223623 non-null object
prod_class 223623 non-null object
prod_id 223623 non-null object
amount 223623 non-null object
asset 223623 non-null object
price 223623 non-null object
Even if I would use the dtype-option on import, I would somehow still be scared of wrong/bad results as there might happen some wrong casting of datatypes while importing!?
How to overcome and solve the issue?
Thanks a lot
Whenever you have a problem that is too boring to be done by hand, the solution is to write a program:
for col in ('age', 'area'):
for i, val in enumerate(A[col]):
try:
int(val)
except:
print('Line {}: {} = {}'.format(i, col, val))
This will show you all the lines in the file with non-integer values in the age and area columns. This is the first step in debugging the problem. Once you know what the problematic values are, you can better decide how to deal with them -- maybe by pre-processing (cleaning) the data file, or by using some pandas code to select and fix the problematic values.