pandas - write to csv only if column contains certain values

pandas - write to csv only if column contains certain values - python

I have a pandas df that looks like this:
df = pd.DataFrame([[1,'hello,bye','USA','3/20/2016 7:00:17 AM'],[2,'good morning','UK','3/20/2016 7:00:20 AM']],columns=['id','text','country','datetime'])
id text country datetime
0 1 hello,bye USA 3/20/2016 7:00:17 AM
1 2 good morning UK 3/20/2016 7:00:20 AM
I want to print this output to csv but only if the country column contains 'USA'.
This is what I tried:
if 'USA' in df.country.values:
df.to_csv('test.csv')
but it prints the entire df to the test.csv file still.

Here is a simple solution to your problem:
df = pd.DataFrame([[1,'hello,bye','USA','3/20/2016 7:00:17 AM'],[2,'good morning','UK','3/20/2016 7:00:20 AM']],columns=['id','text','country','datetime'])
if 'USA' in df.country.tolist():
df.to_csv('test.csv')
Alternatively, you can also do this by:
df['country'].tolist()
Hope this helps you :)

Related

How to join columns in CSV files using Pandas in Python

I have a CSV file that looks something like this:
# data.csv (this line is not there in the file)
Names, Age, Names
John, 5, Jane
Rian, 29, Rath
And when I read it through Pandas in Python I get something like this:
import pandas as pd
data = pd.read_csv("data.csv")
print(data)
And the output of the program is:
Names Age Names
0 John 5 Jane
1 Rian 29 Rath
Is there any way to get:
Names Age
0 John 5
1 Rian 29
2 Jane
3 Rath

First, I'd suggest having unique names for each column. Either go into the csv file and change the name of a column header or do so in pandas.
Using 'Names2' as the header of the column with the second occurence of the same column name, try this:
Starting from
datalist = [['John', 5, 'Jane'], ['Rian', 29, 'Rath']]
df = pd.DataFrame(datalist, columns=['Names', 'Age', 'Names2'])
We have
Names Age Names
0 John 5 Jane
1 Rian 29 Rath
So, use:
dff = pd.concat([df['Names'].append(df['Names2'])
.reset_index(drop=True),
df.iloc[:,1]], ignore_index=True, axis=1)
.fillna('').rename(columns=dict(enumerate(['Names', 'Ages'])))
to get your desired result.
From the inside out:
df.append combines the columns.
pd.concat( ... ) combines the results of df.append with the rest of the dataframe.
To discover what the other commands do, I suggest removing them one-by-one and looking at the results.
Please forgive the formating of dff. I'm trying to make everything clear from an educational perspective.
Adjust indents so the code will compile.

You can use:
usecols which helps to read only selected columns.
low_memory is used so that we Internally process the file in chunks.
import pandas as pd
data = pd.read_csv("data.csv", usecols = ['Names','Age'], low_memory = False))
print(data)
Please have unique column name in your csv

Delete word from column in python

I have a CSV file which have four column
Like this
Freq ID Date Name
0 2053 1998 apple
2 2054 1998 May-June. orange
3 2055 2019 apple
5 2056 1999 Oct-Nov orange
It is large file and I have to remove May-Jun from Date column and all which have year with month I have to keep only year
How can I remove it from python

you can use pandas to read and extract year from date column. you can use split() function and split on space the first item will be your year
like this
import pandas as pd
df = pd.read_csv(filename)
df['Date'] = df["Date"].str.split(" ").str.get(0)
print(df)

Hi I have recently come across the similar kind of problem, I tried to resolve this using below snippet code, you can probable try using it will work and it most optimized solution I think.
import pandas as pd
import csv
from datetime import datetime
to_datetime = lambda d: datetime.strptime(d[:4] , '%Y')
path = "D:\python_poc"
filename="\Input.csv"
df = pd.read_csv(path+filename,parse_dates=['Date'])
df = pd.read_csv(path+filename, converters={'Date': to_datetime})
df.to_csv(path+filename,index=False,quoting=csv.QUOTE_ALL)

How to concatenate sum on apply function and print dataframe as a table format within a file

I am trying to concatenate the 'count' value into the top row of my dataframe.
Here is an example of my starting data:
Name,IP,Application,Count
Tom,100.100.100,MsWord,5
Tom,100.100.100,Excel,10
Fred,200.200.200,Python,1
Fred,200.200.200,MsWord,5
df = pd.DataFrame(data, columns=['Name', 'IP', 'Application', 'Count'])
df_new = df.groupby(['Name', 'IP'])['Count'].apply(lambda x:x.astype(int).sum())
If I print df_new this produces the following output:
Name,IP,Application,Count
Tom,100.100.100,MsWord,15
................Excel,15
Fred,200.200.200,MsWord,6
................Python,6
As you can see, the count has correctly been calculated, for Tom it has added 5 to 10 and got an output of 15. However, this is displayed on every row of the group.
Is there any way to get the output as follows - so the count is only on the first line of the group:
Name,IP,Application,Count
Tom,100.100.100,MsWord,15
.................Excel
Fred,200.200.200,MsWord,6
.................Python
Is there anyway to write dt_new to a file in this nice format?
I would like the output to appear like a table and almost look like an excel sheet with merged cells.
I have tried dt_new.to.csv('path') but this removes the nice formatting I am seeing when I output dt to the console.

It is a bit of a challenge to treat a DataFrame and have it provide summary rows. Generally, the DataFrame lends itself to results that are not dependent on position, such as the last item in a group. Can be done, but better to separate those concerns.
import pandas as pd
from StringIO import StringIO
data = StringIO("""Name,IP,Application,Count
Tom,100.100.100,MsWord,5
Tom,100.100.100,Excel,10
Fred,200.200.200,Python,1
Fred,200.200.200,MsWord,5""")
#df = pd.DataFrame(data, columns=['Name', 'IP', 'Application', 'Count'])
#df_new = df.groupby(['Name', 'IP', 'Application'])['Count'].apply(lambda x:x.astype(int).sum())
df = pd.read_csv(data)
new_df = df.groupby(['Name', 'IP']).sum()
# reset the two levels of columns resulting from the groupby()
new_df.reset_index(inplace=True)
df.set_index(['Name', 'IP'], inplace=True)
new_df.set_index(['Name', 'IP'], inplace=True)
print(df)
Application Count
Name IP
Tom 100.100.100 MsWord 5
100.100.100 Excel 10
Fred 200.200.200 Python 1
200.200.200 MsWord 5
print(new_df)
Count
Name IP
Fred 200.200.200 6
Tom 100.100.100 15
print(new_df.join(df, lsuffix='_lsuffix', rsuffix='_rsuffix'))
Count_lsuffix Application Count_rsuffix
Name IP
Fred 200.200.200 6 Python 1
200.200.200 6 MsWord 5
Tom 100.100.100 15 MsWord 5
100.100.100 15 Excel 10
From here, you can use the multiindex to access the sum of the groups.

Shift down one row then rename the column

My data is looking like this:
pd.read_csv('/Users/admin/desktop/007538839.csv').head()
105586.18
0 105582.910
1 105585.230
2 105576.445
3 105580.016
4 105580.266
I want to move that 105568.18 to the 0 index because now it is the column name. And after that I want to name this column 'flux'. I've tried
pd.read_csv('/Users/admin/desktop/007538839.csv', sep='\t', names = ["flux"])
but it did not work, probably because the dataframe is not in the right format.
How can I achieve that?

For me your code working very nice:
import pandas as pd
temp=u"""105586.18
105582.910
105585.230
105576.445
105580.016
105580.266"""
#after testing replace 'pd.compat.StringIO(temp)' to '/Users/admin/desktop/007538839.csv'
df = pd.read_csv(pd.compat.StringIO(temp), sep='\t', names = ["flux"])
print (df)
flux
0 105586.180
1 105582.910
2 105585.230
3 105576.445
4 105580.016
5 105580.266
For overwrite original file with same data with new header flux:
df.to_csv('/Users/admin/desktop/007538839.csv', index=False)

Try this:
df=pd.read_csv('/Users/admin/desktop/007538839.csv',header=None)
df.columns=['flux']
header=None is the friend of yours.

Parsing a JSON string enclosed with quotation marks from a CSV using Pandas

Similar to this question, but my CSV has a slightly different format. Here is an example:
id,employee,details,createdAt
1,John,"{"Country":"USA","Salary":5000,"Review":null}","2018-09-01"
2,Sarah,"{"Country":"Australia", "Salary":6000,"Review":"Hardworking"}","2018-09-05"
I think the double quotation mark in the beginning of the JSON column might have caused some errors. Using df = pandas.read_csv('file.csv'), this is the dataframe that I got:
id employee details createdAt Unnamed: 1 Unnamed: 2
1 John {Country":"USA" Salary:5000 Review:null}" 2018-09-01
2 Sarah {Country":"Australia" Salary:6000 Review:"Hardworking"}" 2018-09-05
My desired output:
id employee details createdAt
1 John {"Country":"USA","Salary":5000,"Review":null} 2018-09-01
2 Sarah {"Country":"Australia","Salary":6000,"Review":"Hardworking"} 2018-09-05
I've tried adding quotechar='"' as the parameter and it still doesn't give me the result that I want. Is there a way to tell pandas to ignore the first and the last quotation mark surrounding the json value?

As an alternative approach you could read the file in manually, parse each row correctly and use the resulting data to contruct the dataframe. This works by splitting the row both forward and backwards to get the non-problematic columns and then taking the remaining part:
import pandas as pd
data = []
with open("e1.csv") as f_input:
for row in f_input:
row = row.strip()
split = row.split(',', 2)
rsplit = [cell.strip('"') for cell in split[-1].rsplit(',', 1)]
data.append(split[0:2] + rsplit)
df = pd.DataFrame(data[1:], columns=data[0])
print(df)
This would display your data as:
id employee details createdAt
0 1 John {"Country":"USA","Salary":5000,"Review":null} 2018-09-01
1 2 Sarah {"Country":"Australia", "Salary":6000,"Review"... 2018-09-05

I have reproduced your file
With
df = pd.read_csv('e1.csv', index_col=None )
print (df)
Output
id emp details createdat
0 1 john "{"Country":"USA","Salary":5000,"Review":null}" "2018-09-01"
1 2 sarah "{"Country":"Australia", "Salary":6000,"Review... "2018-09-05"

I think there's a better way by passing a regex to sep=r',"|",|(?<=\d),' and possibly some other combination of parameters. I haven't figured it out totally.
Here is a less than optimal option:
df = pd.read_csv('s083838383.csv', sep='##$%^', engine='python')
header = df.columns[0]
print(df)
Why sep='##$%^' ? This is just garbage that allows you to read the file with no sep character. It could be any random character and is just used as a means to import the data into a df object to work with.
df looks like this:
id,employee,details,createdAt
0 1,John,"{"Country":"USA","Salary":5000,"Review...
1 2,Sarah,"{"Country":"Australia", "Salary":6000...
Then you could use str.extract to apply regex and expand the columns:
result = df[header].str.extract(r'(.+),(.+),("\{.+\}"),(.+)',
expand=True).applymap(str.strip)
result.columns = header.strip().split(',')
print(result)
result is:
id employee details createdAt
0 1 John "{"Country":"USA","Salary":5000,"Review":null}" "2018-09-01"
1 2 Sarah "{"Country":"Australia", "Salary":6000,"Review... "2018-09-05"
If you need the starting and ending quotes stripped off of the details string values, you could do:
result['details'] = result['details'].str.strip('"')
If the details object items needs to be a dicts instead of strings, you could do:
from json import loads
result['details'] = result['details'].apply(loads)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas - write to csv only if column contains certain values - python

Related

How to join columns in CSV files using Pandas in Python

Delete word from column in python

How to concatenate sum on apply function and print dataframe as a table format within a file

Shift down one row then rename the column

Parsing a JSON string enclosed with quotation marks from a CSV using Pandas

Categories

Resources