Intersection of rows from text files using python

up vote
6
down vote

favorite

I have a number of text files which have text and numerical entries in them. More specifically, the first two rows and the first column of all these files are text, and all other entries are comprised of numbers. As an example, let's consider three separate text files (files A, B and C) in the formats as shown in the pictures given below.

enter image description here

Text files for the above pictures can be found here (file_A), here (file_B) and here (file_C).

I want to use python to create a separate text file which contains only those rows (starting from row 3) which are present in all these three files (i.e. intersection of rows). This final text file is expected to keep relevant columns from all these three files. Following would be a picture of the final file that I want. enter image description here (Text file for this final file can be found here.)

I realize that I can use pandas to import each of the mentioned text files using a command like the following:

df = pd.read_csv('filename.txt',sep='t', header=[0,1], index_col=0)

How can I use python to make this final file once I have files A, B and C? I would like to use python 2.7 for this.

asked 1 hour ago

Ling Guo

1356

So concatinating dint help? i.e pd.concat([A,B,C],axis=1).dropna(), dropna to remove the rows with missing values.
â€“Â Dark
1 hour ago

add a commentÂ |Â

up vote
6
down vote

favorite

enter image description here

Text files for the above pictures can be found here (file_A), here (file_B) and here (file_C).

I realize that I can use pandas to import each of the mentioned text files using a command like the following:

df = pd.read_csv('filename.txt',sep='t', header=[0,1], index_col=0)

How can I use python to make this final file once I have files A, B and C? I would like to use python 2.7 for this.

asked 1 hour ago

Ling Guo

1356

So concatinating dint help? i.e pd.concat([A,B,C],axis=1).dropna(), dropna to remove the rows with missing values.
â€“Â Dark
1 hour ago

add a commentÂ |Â

up vote
6
down vote

favorite

enter image description here

Text files for the above pictures can be found here (file_A), here (file_B) and here (file_C).

I realize that I can use pandas to import each of the mentioned text files using a command like the following:

df = pd.read_csv('filename.txt',sep='t', header=[0,1], index_col=0)

How can I use python to make this final file once I have files A, B and C? I would like to use python 2.7 for this.

asked 1 hour ago

Ling Guo

1356

enter image description here

Text files for the above pictures can be found here (file_A), here (file_B) and here (file_C).

I realize that I can use pandas to import each of the mentioned text files using a command like the following:

df = pd.read_csv('filename.txt',sep='t', header=[0,1], index_col=0)

How can I use python to make this final file once I have files A, B and C? I would like to use python 2.7 for this.

python python-2.7 pandas

asked 1 hour ago

Ling Guo

1356

asked 1 hour ago

Ling Guo

1356

asked 1 hour ago

Ling Guo

1356

asked 1 hour ago

Ling Guo

1356

asked 1 hour ago

Ling Guo

1356

So concatinating dint help? i.e pd.concat([A,B,C],axis=1).dropna(), dropna to remove the rows with missing values.
â€“Â Dark
1 hour ago

add a commentÂ |Â

So concatinating dint help? i.e pd.concat([A,B,C],axis=1).dropna(), dropna to remove the rows with missing values.
â€“Â Dark
1 hour ago

So concatinating dint help? i.e pd.concat([A,B,C],axis=1).dropna(), dropna to remove the rows with missing values.
â€“Â Dark
1 hour ago

add a commentÂ |Â

2 Answers
2

active

oldest

votes

up vote
4
down vote

Use pd.merge() -

from functools import reduce
dfs = [df1, df2, df3]
df_final = reduce(lambda left,right: pd.merge(left,right,on=None,left_index=True,right_index=True), dfs)

df1.join(df2, how='inner').join(df3, how='inner')

These gives you a foolproof way of joining by keys -

 Type A1 A2 A3 A4 A5 B1 B2 C1 C2 C3 C4
 Tag TagA1 TagA1 TagA1 TagA2 TagA2 TagB1 TagB2 TagC1 TagC1 TagC2 TagC2
 object1 1.1 2.1 3.1 4.1 5.1 11.1 12.1 21.1 22.1 23.1 24.1
 object4 1.4 2.4 3.4 4.4 5.4 11.4 12.4 21.4 22.4 23.4 24.4
 object7 1.7 2.7 3.7 4.7 5.7 11.7 12.7 21.7 22.7 23.7 24.7
 object9 1.9 2.9 3.9 4.9 5.9 11.9 12.9 21.9 22.9 23.9 24.9

Timings

@Dark

2.63 ms Ã‚Â± 598 Ã‚Âµs per loop (mean Ã‚Â± std. dev. of 7 runs, 100 loops each)

@Vivek

2.9 ms Ã‚Â± 252 Ã‚Âµs per loop (mean Ã‚Â± std. dev. of 7 runs, 100 loops each)

@Vivek (2nd - using join)

5.4 ms Ã‚Â± 662 Ã‚Âµs per loop (mean Ã‚Â± std. dev. of 7 runs, 100 loops each)

Since there is no hassle on going by keys, @Dark's solution is slightly more efficient. But since OP mentioned contains only those rows (starting from row 3) which are present in all these three files, I went by the merge way rather than the concat way

edited 58 mins ago

answered 1 hour ago

Vivek Kalyanarangan

3,6221724

add a commentÂ |Â

up vote
4
down vote

Use concat and dropna, dropna will remove the rows with missing values so you will be left with only rows that are found in all the files i.e

df_m = pd.concat([A,B,C],axis=1).dropna()

Type A1 A2 A3 A4 A5 B1 B2 C1 C2 C3 C4
Tag TagA1 TagA1 TagA1 TagA2 TagA2 TagB1 TagB2 TagC1 TagC1 TagC2 TagC2
object1 1.1 2.1 3.1 4.1 5.1 11.1 12.1 21.1 22.1 23.1 24.1
object4 1.4 2.4 3.4 4.4 5.4 11.4 12.4 21.4 22.4 23.4 24.4
object7 1.7 2.7 3.7 4.7 5.7 11.7 12.7 21.7 22.7 23.7 24.7
object9 1.9 2.9 3.9 4.9 5.9 11.9 12.9 21.9 22.9 23.9 24.9

This solution holds good if you confirm that there are no nans in each of the dataframes. If thats the case then you can go for @Vivek Kalyanarangan's solution..

edited 18 mins ago

answered 1 hour ago

Dark

20.4k31945

2

Don't you think the dropna() is a little risky? What if the Type/Tag is common but the data contains NaN? I think in this case OP would like to retain those rows
â€“Â Vivek Kalyanarangan
53 mins ago

1

@VivekKalyanarangan didn't think of that. Well this is what I could come up with when I saw the question :). Perhaps filling the nans of the each dataframe with some arbitrary value then dropping nans would be the solution.
â€“Â Dark
23 mins ago

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53165860%2fintersection-of-rows-from-text-files-using-python%23new-answer', 'question_page');

);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
4
down vote

Use pd.merge() -

from functools import reduce
dfs = [df1, df2, df3]
df_final = reduce(lambda left,right: pd.merge(left,right,on=None,left_index=True,right_index=True), dfs)

df1.join(df2, how='inner').join(df3, how='inner')

These gives you a foolproof way of joining by keys -

 Type A1 A2 A3 A4 A5 B1 B2 C1 C2 C3 C4
 Tag TagA1 TagA1 TagA1 TagA2 TagA2 TagB1 TagB2 TagC1 TagC1 TagC2 TagC2
 object1 1.1 2.1 3.1 4.1 5.1 11.1 12.1 21.1 22.1 23.1 24.1
 object4 1.4 2.4 3.4 4.4 5.4 11.4 12.4 21.4 22.4 23.4 24.4
 object7 1.7 2.7 3.7 4.7 5.7 11.7 12.7 21.7 22.7 23.7 24.7
 object9 1.9 2.9 3.9 4.9 5.9 11.9 12.9 21.9 22.9 23.9 24.9

Timings

@Dark

2.63 ms Ã‚Â± 598 Ã‚Âµs per loop (mean Ã‚Â± std. dev. of 7 runs, 100 loops each)

@Vivek

2.9 ms Ã‚Â± 252 Ã‚Âµs per loop (mean Ã‚Â± std. dev. of 7 runs, 100 loops each)

@Vivek (2nd - using join)

5.4 ms Ã‚Â± 662 Ã‚Âµs per loop (mean Ã‚Â± std. dev. of 7 runs, 100 loops each)

edited 58 mins ago

answered 1 hour ago

Vivek Kalyanarangan

3,6221724

add a commentÂ |Â

up vote
4
down vote

Use pd.merge() -

from functools import reduce
dfs = [df1, df2, df3]
df_final = reduce(lambda left,right: pd.merge(left,right,on=None,left_index=True,right_index=True), dfs)

df1.join(df2, how='inner').join(df3, how='inner')

These gives you a foolproof way of joining by keys -

 Type A1 A2 A3 A4 A5 B1 B2 C1 C2 C3 C4
 Tag TagA1 TagA1 TagA1 TagA2 TagA2 TagB1 TagB2 TagC1 TagC1 TagC2 TagC2
 object1 1.1 2.1 3.1 4.1 5.1 11.1 12.1 21.1 22.1 23.1 24.1
 object4 1.4 2.4 3.4 4.4 5.4 11.4 12.4 21.4 22.4 23.4 24.4
 object7 1.7 2.7 3.7 4.7 5.7 11.7 12.7 21.7 22.7 23.7 24.7
 object9 1.9 2.9 3.9 4.9 5.9 11.9 12.9 21.9 22.9 23.9 24.9

Timings

@Dark

2.63 ms Ã‚Â± 598 Ã‚Âµs per loop (mean Ã‚Â± std. dev. of 7 runs, 100 loops each)

@Vivek

2.9 ms Ã‚Â± 252 Ã‚Âµs per loop (mean Ã‚Â± std. dev. of 7 runs, 100 loops each)

@Vivek (2nd - using join)

5.4 ms Ã‚Â± 662 Ã‚Âµs per loop (mean Ã‚Â± std. dev. of 7 runs, 100 loops each)

edited 58 mins ago

answered 1 hour ago

Vivek Kalyanarangan

3,6221724

add a commentÂ |Â

up vote
4
down vote

Use pd.merge() -

from functools import reduce
dfs = [df1, df2, df3]
df_final = reduce(lambda left,right: pd.merge(left,right,on=None,left_index=True,right_index=True), dfs)

df1.join(df2, how='inner').join(df3, how='inner')

These gives you a foolproof way of joining by keys -

 Type A1 A2 A3 A4 A5 B1 B2 C1 C2 C3 C4
 Tag TagA1 TagA1 TagA1 TagA2 TagA2 TagB1 TagB2 TagC1 TagC1 TagC2 TagC2
 object1 1.1 2.1 3.1 4.1 5.1 11.1 12.1 21.1 22.1 23.1 24.1
 object4 1.4 2.4 3.4 4.4 5.4 11.4 12.4 21.4 22.4 23.4 24.4
 object7 1.7 2.7 3.7 4.7 5.7 11.7 12.7 21.7 22.7 23.7 24.7
 object9 1.9 2.9 3.9 4.9 5.9 11.9 12.9 21.9 22.9 23.9 24.9

Timings

@Dark

2.63 ms Ã‚Â± 598 Ã‚Âµs per loop (mean Ã‚Â± std. dev. of 7 runs, 100 loops each)

@Vivek

2.9 ms Ã‚Â± 252 Ã‚Âµs per loop (mean Ã‚Â± std. dev. of 7 runs, 100 loops each)

@Vivek (2nd - using join)

5.4 ms Ã‚Â± 662 Ã‚Âµs per loop (mean Ã‚Â± std. dev. of 7 runs, 100 loops each)

edited 58 mins ago

answered 1 hour ago

Vivek Kalyanarangan

3,6221724

Use pd.merge() -

from functools import reduce
dfs = [df1, df2, df3]
df_final = reduce(lambda left,right: pd.merge(left,right,on=None,left_index=True,right_index=True), dfs)

df1.join(df2, how='inner').join(df3, how='inner')

These gives you a foolproof way of joining by keys -

 Type A1 A2 A3 A4 A5 B1 B2 C1 C2 C3 C4
 Tag TagA1 TagA1 TagA1 TagA2 TagA2 TagB1 TagB2 TagC1 TagC1 TagC2 TagC2
 object1 1.1 2.1 3.1 4.1 5.1 11.1 12.1 21.1 22.1 23.1 24.1
 object4 1.4 2.4 3.4 4.4 5.4 11.4 12.4 21.4 22.4 23.4 24.4
 object7 1.7 2.7 3.7 4.7 5.7 11.7 12.7 21.7 22.7 23.7 24.7
 object9 1.9 2.9 3.9 4.9 5.9 11.9 12.9 21.9 22.9 23.9 24.9

Timings

@Dark

2.63 ms Ã‚Â± 598 Ã‚Âµs per loop (mean Ã‚Â± std. dev. of 7 runs, 100 loops each)

@Vivek

2.9 ms Ã‚Â± 252 Ã‚Âµs per loop (mean Ã‚Â± std. dev. of 7 runs, 100 loops each)

@Vivek (2nd - using join)

5.4 ms Ã‚Â± 662 Ã‚Âµs per loop (mean Ã‚Â± std. dev. of 7 runs, 100 loops each)

edited 58 mins ago

answered 1 hour ago

Vivek Kalyanarangan

3,6221724

edited 58 mins ago

answered 1 hour ago

Vivek Kalyanarangan

3,6221724

answered 1 hour ago

Vivek Kalyanarangan

3,6221724

answered 1 hour ago

Vivek Kalyanarangan

3,6221724

add a commentÂ |Â

up vote
4
down vote

Use concat and dropna, dropna will remove the rows with missing values so you will be left with only rows that are found in all the files i.e

df_m = pd.concat([A,B,C],axis=1).dropna()

Type A1 A2 A3 A4 A5 B1 B2 C1 C2 C3 C4
Tag TagA1 TagA1 TagA1 TagA2 TagA2 TagB1 TagB2 TagC1 TagC1 TagC2 TagC2
object1 1.1 2.1 3.1 4.1 5.1 11.1 12.1 21.1 22.1 23.1 24.1
object4 1.4 2.4 3.4 4.4 5.4 11.4 12.4 21.4 22.4 23.4 24.4
object7 1.7 2.7 3.7 4.7 5.7 11.7 12.7 21.7 22.7 23.7 24.7
object9 1.9 2.9 3.9 4.9 5.9 11.9 12.9 21.9 22.9 23.9 24.9

This solution holds good if you confirm that there are no nans in each of the dataframes. If thats the case then you can go for @Vivek Kalyanarangan's solution..

edited 18 mins ago

answered 1 hour ago

Dark

20.4k31945

2

Don't you think the dropna() is a little risky? What if the Type/Tag is common but the data contains NaN? I think in this case OP would like to retain those rows
â€“Â Vivek Kalyanarangan
53 mins ago

1

@VivekKalyanarangan didn't think of that. Well this is what I could come up with when I saw the question :). Perhaps filling the nans of the each dataframe with some arbitrary value then dropping nans would be the solution.
â€“Â Dark
23 mins ago

add a commentÂ |Â

up vote
4
down vote

Use concat and dropna, dropna will remove the rows with missing values so you will be left with only rows that are found in all the files i.e

df_m = pd.concat([A,B,C],axis=1).dropna()

Type A1 A2 A3 A4 A5 B1 B2 C1 C2 C3 C4
Tag TagA1 TagA1 TagA1 TagA2 TagA2 TagB1 TagB2 TagC1 TagC1 TagC2 TagC2
object1 1.1 2.1 3.1 4.1 5.1 11.1 12.1 21.1 22.1 23.1 24.1
object4 1.4 2.4 3.4 4.4 5.4 11.4 12.4 21.4 22.4 23.4 24.4
object7 1.7 2.7 3.7 4.7 5.7 11.7 12.7 21.7 22.7 23.7 24.7
object9 1.9 2.9 3.9 4.9 5.9 11.9 12.9 21.9 22.9 23.9 24.9

This solution holds good if you confirm that there are no nans in each of the dataframes. If thats the case then you can go for @Vivek Kalyanarangan's solution..

edited 18 mins ago

answered 1 hour ago

Dark

20.4k31945

2

Don't you think the dropna() is a little risky? What if the Type/Tag is common but the data contains NaN? I think in this case OP would like to retain those rows
â€“Â Vivek Kalyanarangan
53 mins ago

1

@VivekKalyanarangan didn't think of that. Well this is what I could come up with when I saw the question :). Perhaps filling the nans of the each dataframe with some arbitrary value then dropping nans would be the solution.
â€“Â Dark
23 mins ago

add a commentÂ |Â

up vote
4
down vote

Use concat and dropna, dropna will remove the rows with missing values so you will be left with only rows that are found in all the files i.e

df_m = pd.concat([A,B,C],axis=1).dropna()

Type A1 A2 A3 A4 A5 B1 B2 C1 C2 C3 C4
Tag TagA1 TagA1 TagA1 TagA2 TagA2 TagB1 TagB2 TagC1 TagC1 TagC2 TagC2
object1 1.1 2.1 3.1 4.1 5.1 11.1 12.1 21.1 22.1 23.1 24.1
object4 1.4 2.4 3.4 4.4 5.4 11.4 12.4 21.4 22.4 23.4 24.4
object7 1.7 2.7 3.7 4.7 5.7 11.7 12.7 21.7 22.7 23.7 24.7
object9 1.9 2.9 3.9 4.9 5.9 11.9 12.9 21.9 22.9 23.9 24.9

This solution holds good if you confirm that there are no nans in each of the dataframes. If thats the case then you can go for @Vivek Kalyanarangan's solution..

edited 18 mins ago

answered 1 hour ago

Dark

20.4k31945

Use concat and dropna, dropna will remove the rows with missing values so you will be left with only rows that are found in all the files i.e

df_m = pd.concat([A,B,C],axis=1).dropna()

Type A1 A2 A3 A4 A5 B1 B2 C1 C2 C3 C4
Tag TagA1 TagA1 TagA1 TagA2 TagA2 TagB1 TagB2 TagC1 TagC1 TagC2 TagC2
object1 1.1 2.1 3.1 4.1 5.1 11.1 12.1 21.1 22.1 23.1 24.1
object4 1.4 2.4 3.4 4.4 5.4 11.4 12.4 21.4 22.4 23.4 24.4
object7 1.7 2.7 3.7 4.7 5.7 11.7 12.7 21.7 22.7 23.7 24.7
object9 1.9 2.9 3.9 4.9 5.9 11.9 12.9 21.9 22.9 23.9 24.9

This solution holds good if you confirm that there are no nans in each of the dataframes. If thats the case then you can go for @Vivek Kalyanarangan's solution..

edited 18 mins ago

answered 1 hour ago

Dark

20.4k31945

edited 18 mins ago

answered 1 hour ago

Dark

20.4k31945

answered 1 hour ago

Dark

20.4k31945

answered 1 hour ago

Dark

20.4k31945

2

Don't you think the dropna() is a little risky? What if the Type/Tag is common but the data contains NaN? I think in this case OP would like to retain those rows
â€“Â Vivek Kalyanarangan
53 mins ago

1

@VivekKalyanarangan didn't think of that. Well this is what I could come up with when I saw the question :). Perhaps filling the nans of the each dataframe with some arbitrary value then dropping nans would be the solution.
â€“Â Dark
23 mins ago

add a commentÂ |Â

2

Don't you think the dropna() is a little risky? What if the Type/Tag is common but the data contains NaN? I think in this case OP would like to retain those rows
â€“Â Vivek Kalyanarangan
53 mins ago

1

@VivekKalyanarangan didn't think of that. Well this is what I could come up with when I saw the question :). Perhaps filling the nans of the each dataframe with some arbitrary value then dropping nans would be the solution.
â€“Â Dark
23 mins ago

Don't you think the dropna() is a little risky? What if the Type/Tag is common but the data contains NaN? I think in this case OP would like to retain those rows
â€“Â Vivek Kalyanarangan
53 mins ago

@VivekKalyanarangan didn't think of that. Well this is what I could come up with when I saw the question :). Perhaps filling the nans of the each dataframe with some arbitrary value then dropping nans would be the solution.
â€“Â Dark
23 mins ago

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Search This Blog

Iyfjky